14

I have a form which lets users input text snippets. So how can figure out the language of the entered text?

Specifically these languages for now:

Arabic: هذه هي بعض النصوص العربية

Chinese: 这是一些阿拉伯文字

Japanese: これは、いくつかのアラビア語のテキストです

[Edit] The detection has work on text which is retrieved via an API too (no browser involved)

philfreo
  • 41,941
  • 26
  • 128
  • 141
Yeti
  • 5,628
  • 9
  • 45
  • 71
  • Possible duplicate of [Detect language from string in PHP](http://stackoverflow.com/questions/1441562/detect-language-from-string-in-php) – cweiske Feb 28 '17 at 13:57
  • See also: [How to detect language](https://stackoverflow.com/q/3173005/562769) – Martin Thoma Aug 13 '17 at 13:09

5 Answers5

8

You can figure out whether the characters are from the Arabic, Chinese, or Japanese sections of the Unicode map.

If you look at the list on Wikipedia, you'll see that each of those languages has many sections of the map. But you're not doing translation, so you don't need to worry about every last glyph.

For example, your Chinese text begins (in hex) 0x8FD9 0x662F 0x4E00 - and those are all in the "CJK Unified Ideographs" section, which is Chinese. Here are a few ranges to get you started:

Arabic (0600–06FF)

Japanese

  • Hiragana (3040–309F)
  • Katakana (30A0–30FF)
  • Kanbun (3190–319F)

Chinese

  • CJK Unified Ideographs (4E00–9FFF)

(I got the hex for your Chinese by using a Chinese to Unicode Converter.)

egrunin
  • 24,650
  • 8
  • 50
  • 93
  • 1
    Does this method distinguish Persian and Arabic too? – xkcd Sep 26 '11 at 08:52
  • 1
    This doesn't detect *language*, it detects *characters*. So unless Persian contains characters that are not used in Arabic, this will not do that. – egrunin Sep 29 '11 at 17:48
  • 2
    1. "CJK Unified" C is for Chinese, and J is for Japanese. That means these characters may be in both Chinese or Japanese. 2. CJK characters covers more unicode points than what described here. – tsh Apr 01 '20 at 07:39
  • @tsh True, I'd forgotten about the overlap between Japanese and Chinese. – egrunin Apr 14 '20 at 21:23
2

You could use the Google Ajax API for detecting the language of a snippet of text.

ChristopheD
  • 112,638
  • 29
  • 165
  • 179
1

Presumably guessing the user's language is to display responses in the proper language. What about examining the browser's settings for preferred languages? Obtain that from the HTTP header Accept-Language. See section 14.4 here.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • This sounds like a good solution, but I forgot to mention the technique has to work on text retrieved via API too. – Yeti May 02 '10 at 07:08
0

I'm exploring the same thing, for server-side. Thus far I have found https://code.google.com/p/language-detection/. Hope this helps someone.

JRun
  • 3,328
  • 6
  • 27
  • 41
0

You could use https://detectlanguage.com/ which is a webservice build around CLD2.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958