7

Well, i knew this question being asked multiple of times but i still couldn't fix it with the "available" solution. Hope to got any further ideas or concepts of how to detect my sentences is english in python. The available solution:

  • Language Detector (in ruby not in python :/)
  • Google Translate API v2 (No longer free, have to pay 20 bucks a month while i'm doing this project for academic purposes. Courtesy limit: 0 characters/day )
  • Language identification for python (source code not found, link at below. automatic-language-identification)
  • Enchant (it's not for python 2.7? I'm new to python, any guide? I bet this would be the one i need)
  • Wordnet from NLTK (i got no idea why "wordnet.synsets" is missing and only "wordnet.Synset" is available. the sample code in solution is not working for me as well T_T, probably versioning issue again?)
  • Store english words into list and compare if the word exist (yea, it's kinda bad approach while the sentences are from twitter and.. you knew that :P)

WORKING SOLUTION

Finally after a series of trying, the following is the working solution (alternative to the above list)

  • Wiktionary API (Using Urllib2, and simplejson to parse it. then find if the key is -1 means the word doesn't exist. else it's english. of course, for use in twitter have to preprocess your word into no special character like @#,?!. For how to find the key would referencing here. Simplejson and random key value)
  • Answer from Dogukan Tufekci (Ticked)(Weakness: Let say if the sentence shorter than 20 characters long have to install PyEnchant or it will return UNKNOWN. While PyEnchant is not supporting Python 2.7, means couldn't install and not working to less than 20 character sentence)

References

Community
  • 1
  • 1
1myb
  • 3,536
  • 12
  • 53
  • 73
  • Interesting question. An improvement to storing words in a list would be to store them in a set or dictionary. The list approach is O(n) where the other approaches are O(1). – Octipi Mar 07 '13 at 00:48
  • Don't put the solution in the question, instead post it as an answer. Answering your own question if you have the answer is encouraged – Tim Jan 14 '16 at 15:17

2 Answers2

8

You can try the guess_language library that I found through the Miguel Grinber's The Flask Mega Tutorial. It looks like it supports Python 2 and 3 so it should be ok.

Lipis
  • 21,388
  • 20
  • 94
  • 121
Dogukan Tufekci
  • 2,978
  • 3
  • 17
  • 21
  • Thanks ;) Recently i couldn't find the documentation and ignored. Btw, do you have any clue on how to fix this? Import no error but when i try to call guess_language("My Sentence"), it return me the following: Traceback (most recent call last): File "", line 1, in TypeError: 'module' object is not callable – 1myb Mar 07 '13 at 01:33
  • 3
    Your import shall be this: from guess_language import guessLanguage and your call shall be guessLanguage('My sentence') . You are calling the module which is wrong. Type error is indeed helpful if you try to understand what is says. In this case it says you are calling a 'module' object. – Dogukan Tufekci Mar 07 '13 at 01:39
  • Tefekci, Thanks a lot ;) The annoying documentation -.- – 1myb Mar 07 '13 at 01:43
  • @DogukanTufekci thanks so much! The documentation wasn't helping... But now it's working :-) – bknopper Dec 19 '14 at 16:32
1

You might be able to make use of Hidden Markov models to detect languages, each language would have their own characteristics.

Arafangion
  • 11,517
  • 1
  • 40
  • 72
  • May i have some reference link please ;) Thanks – 1myb Mar 07 '13 at 01:34
  • http://en.wikipedia.org/wiki/Hidden_Markov_model sorry for being terse, but basically the probability of a particular sequence of bytes depends on the language. In english, "hello" represents a more likely sequence of bytes than a sequence that rarely occurs in the language, such as, "encontrar". The difference might be slight for individual words, however if you have a phrase, you'd be able to get a more conclusive result. – Arafangion Mar 07 '13 at 14:57
  • Frankly, I'd just go with Dogukan's answer. – Arafangion Mar 07 '13 at 15:03