Where can I find such a corpus? I require this to build a language detector between Hindi and English at the token (word) level.
For instance, something like the Hindi Wikipedia in the Roman alphabet would be quite useful. Or short stories, social media posts or tweets, or blogs? Any ideas?
Existing transliteration engines are not so good as far as I can tell. If there is one which is good, will consider using that too.