Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages
Social media sites like Twitter and Facebook, being user-friendly and accessible sources, provide opportunities for people to air their voices. People, irrespective of age group, use these sites to share every moment of their lives, making these sites flooded with data. Apart from these commendable features of social media, they also have downsides. The large fraction of hate speech and other offensive and objectionable content online poses a vast challenge to societies. Offensive language such as insulting, hurtful, derogatory, or obscene content directed from one person to another and open to others undermines objective discussions. Such a type of language can be increasingly found on the web and lead to the radicalization of debates. Public opinion-forming requires rational critical discourse (Habermas 1984). Objectionable content can pose a threat to democracy. At the same time, open societies need to find an acceptable way to react to such content without imposing rigid censorship regimes.
As a consequence, many platforms of social media websites monitor user posts. This leads to a pressing demand for methods to automatically identify suspicious posts. Online communities, social media enterprises, and technology companies have been investing heavily in technology and processes to identify offensive language to prevent abusive behavior on social media. Furthermore, a conversational thread can contain hate and offensive content, which is not apparent just from a single comment or the reply to a comment but can be identified if given the context of the parent content. Moreover, the contents on such social media are spread in many different languages, including code-mixed languages such as Hinglish(Hindi+English), German-English code mixed, and many more. So it becomes a huge responsibility for these sites to identify such hate content before it disseminates to the masses.