Hate Speech and Offensive Content Identification in Indo-European Languages
This is the call to participate in the Shared Task on Hate Speech and Offensive Content Identification in Indo-European Languages. We invite everyone from academia and industry to participate in the Shared Task on the Identification of Offensive content for Indo-European languages.
There has been significant work in several languages in particular for English. However, there is a lack of research in this recent and relevant topic for most other languages. This track intends to develop data and evaluation resources for several languages. The objectives are to stimulate research for these languages and to find out the quality of hate speech detection technology in other languages. In the long run, the track aims at supporting researchers to develop robust technology which can cope with multilingual data and to develop transfer learning approaches which can exploit learning data across languages. For future editions, we envision the integration of further languages.
Dataset will be created from the Twitter and Facebook and distributed in tab separated format. Participants are allowed to use external resources and other datasets for this task. Dataset will be prepared in 3 languages (German, English and code-mixed hindi).
The size of Training data corpus is approximately 8000 posts for each language.
Our objective behind the HASOC shared task is to leverage the synergies of both forums. HASOC shared task is offered in 3 sub-tasks.Participants in this year’s shared task can choose to participate in one, two or all of the subtasks.
In our annotation, we label a post as HOF if it contains any form of non-acceptable language such as hate speech, aggression, profanity otherwise NOT.
HATE SPEECHDescribing negative attributes or deficiencies to groups of individuals
they are members of a group (e.g. all poor people are stupid). Hateful comment toward groups because
political opinion, sexual orientation, gender, social status, health condition or similar.
OFFENSIVEPosts which are degrading, dehumanizing,insulting an individual,threatening
violent acts are categorized into OFFENSIVE category.
PROFANITYUnacceptable language in the absence of insults and abuse. This typically
the usage of swearwords (Scheiße, Fuck etc.) and cursing (Zur Hölle! Verdammt! etc.) are categorized
We expect most posts to be OTHER, some to be HATE and the other two categories to be less frequent. Dubious cases which are difficult to decide even for humans, should be left out.
The multilingual HASOC Corpus will be sampled from Facebook and Twitter and distributed in tab separated format. Participants are allowed to use external resources and other datasets for this task. Dataset will be prepared in 3 languages (German, English and code-mixed hindi).
The size of Training data corpus is approximately 8000 posts for each language and Test data is approximately 1000 posts for the each language. Classification systems in all tasks will be evaluated using either macro-averaged F1-score or weighted F1-score.