2

Where can I find such a corpus? I require this to build a language detector between Hindi and English at the token (word) level.

For instance, something like the Hindi Wikipedia in the Roman alphabet would be quite useful. Or short stories, social media posts or tweets, or blogs? Any ideas?

Existing transliteration engines are not so good as far as I can tell. If there is one which is good, will consider using that too.

piedpiper
  • 1,222
  • 3
  • 14
  • 27
  • 1
    Roll your own transliteration utility, following for example the rules of the [International Alphabet of Sanskrit Transliteration](https://en.wikipedia.org/wiki/International_Alphabet_of_Sanskrit_Transliteration). AFAIK, Indic language texts are just about never written with the Latin alphabet; transliteration is used only for names, and isolated words or short fragments in books written in a language which uses a non-Indic alphabet. – AlexP Feb 08 '17 at 03:26
  • In the last decade, "Romanagiri" (Roman script hindi) is used ubiquitously in instant messaging and social media. However, it is true that there are no books or more structured texts in that language. Your suggestion is indeed my baseline, but its not good enough in that it does not resemble the transliteration well enough. – piedpiper Feb 08 '17 at 03:33
  • 1
    See "[Romanagari Detection in Twitter](http://home.iitk.ac.in/~hrishirt/cs671/project/report.pdf)" by Hrishikesh Terdalkar and Shubhangi Agarwal, IIT Kanpur (2015); maybe the section on datasets can help. E-mail addresses of the authors are given on a [poster](http://home.iitk.ac.in/~hrishirt/cs671/project/poster.pdf). – AlexP Feb 08 '17 at 03:45
  • @ashu did you find the corpus? I'm looking for it too :) – Arshad Ansari Apr 19 '17 at 09:02
  • @ArshadAnsari one idea is to look for blogs which have Hindi articles written in Roman script. Can't place some links I'd found, will add here when I get them. – piedpiper Aug 03 '17 at 16:32

1 Answers1

1

Google translate provides the transliterated result when searched by selecting 'text' option on https://translate.google.co.in/. Sample.

But, there's a catch. It has a character limit of 5k. Surprisingly enough, google does not provide this feature while translating anywhere else. (google docs, gmail etc.) Please let me know if you were able to find a more feasible and robust solution to your problem.