1

I have a list of comments on YouTube videos in a csv file, each row contains one comment. But the problem is that the comments are in different languages, ex hindi in devnagri script, English in Roman script, and Hindi Comments in Roman script(some people call it Hinglish).

Is there a way to extract the rows having Hindi Comments in Roman script for further processing? If a regex to detect such pattern would be great help.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Vinit
  • 111
  • 4
  • 1
    Welcome! I always recommend new users review [ask] for tips on asking questions in a way that best enables the community to provide guidance. In this case, it would be helpful if you could provide some samples or examples of the input that you are attempting to parse, and a few examples of the desired outcome. Good luck, and happy coding! – Alexander Nied May 24 '21 at 05:06

1 Answers1

2

In the general case, regular expressions are not a good solution to this problem. This is related to Why is it such a bad idea to parse XML with regex? -- a regular expression is excellent for identifying a pattern which doesn't depend on its surroundings, but that's not how human language works. In Indo-Aryan languages, you have "action at distance" phenomena like sandhi which are hard to model with regex.

If your target is solely text which is either in English or in Hindi, you can probably find some heuristics which identify them with some limited accuracy, though. For example, observe that Hindi contains digraphs which are unusual in English, such as bh and dh and aa. Conversely, some digraphs of English are unlikely in Hindi.

However, a better solution with the same basic approach would be to train a simple language identification model which works out a statistical probability based on the characteristics of an entire input text, instead of having a regex make a black vs white decision based on individual letter pairs. Python: How to determine the language? has some suggestions for Python modules which do this.

tripleee
  • 175,061
  • 34
  • 275
  • 318