Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Published in The 4th Workshop on Noisy User-Generated Text, Hong Kong, 2018

Recommended citation: S. Mandal and K. Nanmaran. Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance . The 4th Workshop on Noisy User-Generated Text, Hong Kong (2018). https://arxiv.org/pdf/1805.08701.pdf

abstact
Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.

download paper here

@article{mandal2018normalization,
title={Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model \& Levenshtein Distance},
author={Mandal, Soumil and Nanmaran, Karthick},
journal={arXiv preprint arXiv:1805.08701},
year={2018}
}