2

I am developing a small library automation software and I need to determine a word is in English or Turkish. An example scenario is like this:

  • User enters a book title.
  • Determine it's Turkish or English.
  • Set the languge combobox to the respective language to help user fill the form.

A friend of mine suggested me "connect to Google Translate and use it" which seems reasonable but an algorithm without connecting an external service or database will be more appropriate for me. (I also search the Turkish/English specific characters like ç,ş,İ/w,x to decide) Therefore I am searching an algorithm to do this job maybe based on letter frequencies or something like it. Anything available in literature? Thanks, in advance. (I use php, mysql if it's important)

Barış Akkurt
  • 2,255
  • 3
  • 22
  • 37
  • 4
    http://stackoverflow.com/questions/1441562/detect-language-from-string-in-php you can also check http://wiki.apache.org/solr/LanguageDetection Solr can give you language with probability (for ex this sentence is 90% English or 10% Turkish) – fsw Apr 07 '13 at 21:09
  • 3
    what about the words that are both? –  Apr 07 '13 at 21:09
  • 2
    thanks for all answers. Dagon, I am not expecting a 100% accurate algorithm, frenchie this is a hobby project and I think providing a feature like this may be nice. fsw, your links are suitable for me. I would accept your answer if you wrote it as an answer rather than comment. – Barış Akkurt Apr 07 '13 at 21:17

2 Answers2

3

If the sample you're testing is that small (a single word or phrase) then simple heuristics like letter frequency aren't going to be very useful, as the English phrase "Jazz Quizzes" would probably fit the profile of many languages more readily than English.

You might be able to use frequency of bigraphs and trigraphs (2- and 3-letter combinations), as English and Turkish are sufficiently unrelated as to have combinations which only occur in one.

More likely, however, you are going to have to use a database of actual words from the two languages. In that case, you are probably best off using a third party API or database, rather than going to all the effort building your own corpuses, implementing the statistical algorithms, etc.

IMSoP
  • 89,526
  • 13
  • 117
  • 169
2

As per comment.

please check: Detect language from string in PHP

or:

http://wiki.apache.org/solr/LanguageDetection

Solr can give you language with probability (for example this sentence is 90% English or 10% Turkish)

Community
  • 1
  • 1
fsw
  • 3,595
  • 3
  • 20
  • 34