0

I have a table with two columns: one is id and another one is a text column. I want to keep only the rows that the text value is in English.

The languages I am talking about, are the ones that use non Latin alphabet such as Arabic, Chinese and Cyrillic. This question has been asked around 2012, and I was wondering if there is some new solution rather dealing with it in another programming languages!

GeoBeez
  • 920
  • 2
  • 12
  • 20

1 Answers1

0

It is not an easy problem. There are several libraries for language detection out there (e.g. langdetect), but they don't work inside the database, so you'd have to process all records by selecting them out, processing them in another language then deleting if they fail the test. Furthermore, the accuracy is not great, and decreases as text gets shorter; if your texts are just a couple of words, the accuracy is pretty horrible.

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • 1
    One easier approach is to use the neural network exposed by Google Translate API, to determine language OR It depends on the which Charset that you use Unicode or UTF-8 or UTF-16 https://www.w3schools.com/html/html_charset.asp or https://en.wikipedia.org/wiki/List_of_Unicode_characters It can be confusing to determine which language a text can belong.For example the word "gracias" can be represented using same character for example ASCII or UTF-8. For non-latin languages and non-cyrillic, if you use Unicode, it can be mapped easily – Manivannan Radhakannan Sep 06 '18 at 09:21
  • Actually I do not care, what languages they are, I just wanted to keep the English! and I do have Chinese and Arabic in my data which I want to remove! – GeoBeez Sep 06 '18 at 09:26
  • If they’re all non-Latin, it’s easy. – Amadan Sep 06 '18 at 09:27
  • you believe for non Latin ones, I still need to write a code? – GeoBeez Sep 06 '18 at 09:34
  • No, a simple regexp should suffice. It's the ones who use a subset of English alphabet that are the worst to detect. – Amadan Sep 06 '18 at 17:59