0

How to find the Languages of the given Character,(offline)?. For Example in the list(mytext), I have three characters, the first one is in English, the second one is in Hindi and the third one is in "Tamil". I try to detect the Languages of the character using langdetect package. But It produces irrelevant results. How to get the exact result? (In my case "ta"- "Tamil" (the third one) is correct. The other two are wrong)

mytext =["B","उ","பு"]
from langdetect import detect_langs,detect
print(detect(mytext[0]))
print(detect(mytext[1]))
print(detect(mytext[2]))

Result

tr
ne
ta
Barmar
  • 741,623
  • 53
  • 500
  • 612
tckraomuqnt
  • 470
  • 4
  • 17
  • 4
    langdetect is an AI language detection algorithm. Letters don't have languages, they have scripts. How do you expect any computer to figure out that 'B' is English and not French, German, Spanish, Italian, Rhaeto-Romance or Croatian? – Krateng Jun 13 '22 at 15:56
  • You can't really determine language from a single character, since that character may be used in several different languages. You need whole words or phrases. – Barmar Jun 13 '22 at 15:57
  • 1
    You realize many alphabets are shared among many languages, right? There is no single answer to the language of `'B'` (it's as much Turkish as it is English), nor `"उ"` (it is in fact Nepali too). `"பு"` is Tamil, but the same script is used by Saurashtra, Badaga, Irula and Paniya as well, they're just small minority languages that don't have ISO 639-1 two letter language codes AFAICT, so you got lucky. – ShadowRanger Jun 13 '22 at 15:57
  • They're not wrong; Turkish uses "B", and Nepalese uses "उ". – chepner Jun 13 '22 at 15:57
  • Either way, look at the docs https://github.com/Mimino666/langdetect you can use `detect_langs` to get a confidence score for each lang returned and either take the best one or make your own choice some other way (e.g. if confidence < 0.5 return an error and ask for longer input) – Anentropic Jun 13 '22 at 16:01

0 Answers0