1

I'm using chardet.detect in order to detect the language of a string like in one of the solutions suggested here

my code looks like this:

import chardet

print(chardet.detect('test'.encode()))
print(chardet.detect('בדיקה'.encode()))
print(chardet.detect('тест'.encode()))
print(chardet.detect('テスト'.encode()))

the result I got looks like this:

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

my expected result should look like this:

{'encoding': 'ascii', 'confidence': 1.0, 'language': 'English'}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': 'Hebrew'}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': 'Russian'}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': 'Japanese'}

I prefer using chardet as my solution because I already importing it in my application, and I want to keep it as slim as possible

Community
  • 1
  • 1
kaki gadol
  • 1,116
  • 1
  • 14
  • 34
  • This module is very bad at detecting languages, and often suggests Turkish with a legacy charset for strings that are actually valid UTF-8. At the very least, try decoding as UTF-8 before attempting with chardet. – Tronic May 04 '20 at 16:18
  • well I guess you are right, can you post the comment as an answer and Ill accept it? – kaki gadol May 05 '20 at 14:14

1 Answers1

2

The chardet module is not very good at detecting either charsets or languages. Based on the options listed at Python: How to determine the language? I've found pyCLD3 to be easy to install and to provide good detection even with fairly short snippets of text, even though not perfect with single words like your test:

>>> cld3.get_language("test")                                              
LanguagePrediction(language='ko', probability=0.3396911025047302, is_reliable=False, proportion=1.0)

>>> cld3.get_language("בדיקה")                                             
LanguagePrediction(language='iw', probability=0.9995728731155396, is_reliable=True, proportion=1.0)

>>> cld3.get_language("тест")                                              
LanguagePrediction(language='bg', probability=0.9895398616790771, is_reliable=True, proportion=1.0)

>>> cld3.get_language("テスト")                                            
LanguagePrediction(language='ja', probability=1.0, is_reliable=True, proportion=1.0)

Looks like three out of four because тест is also Bulgarian. The langid module gets all of these right, so that might be a good option also.

Tronic
  • 1,248
  • 12
  • 16
  • but the first line should return English (en), the second should be Hebrew (he) and the third is Russian – kaki gadol May 07 '20 at 09:44
  • 1
    The first line is misdetected but others are correct. Hebrew's old code is `iw` (yes, maybe they should be using the new code `he` instead), and as stated, the Russian word is also Bulgarian (this is why you need longer text samples than only one word). – Tronic May 07 '20 at 13:43