Removing non-english words from a sentence in python

Question

I have written a code which sends queries to Google and returns the results. I extract the snippets(summaries) from these results for further processing. However, sometime non-english words are in these snippets which I don't want them. for example:

/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/

I only want the "unstressed" word in this sentence. How can I do that? thanks

Do you want actual English dictionary words, or just words with only ASCII characters in them (even if they're, say, proper nouns like a name or place, or misspelt English words)? — detly, Oct 27 '10 at 09:18
@ delty: the ASCII character are good. but I tried to encode them to ascii, didn't work... — Hossein, Oct 27 '10 at 09:20
@detly: English words can contain non-ASCII characters (piñata, étude); you probably mean non-Latin characters. — Glenn Maynard, Oct 27 '10 at 09:25
@Hossein: Those aren't in Unicode, they're escaped and unreadable. Paste Unicode text directly, so it's readable. — Glenn Maynard, Oct 27 '10 at 09:28
@detly: Actually, his text--once "decrypted", heh--contains Latin letters in the other words ("wɛn"), so that won't work, either. The only option is heuristic analysis tools, and that will probably never be very reliable... — Glenn Maynard, Oct 27 '10 at 09:35
@Hossein it might be useful to present some information on the goal you are trying to achieve. Do you need a perfect approach or can you live with optimal results. Or even with less? — bastijn, Oct 27 '10 at 09:43
@Glenn Maynard - non-ASCII in English words? Please, I'm Australian — we don't even have 'q' over here. — detly, Oct 27 '10 at 14:23

bastijn · Answer 1 · 2010-10-27T09:36:50.837

4

PyEnchant might be a simple option for you. I do not know about its speed, but you can do things like:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

A tutorial is found here, it also has options to return suggestions which you can you again for another query or something. In addition you can check if your result is in latin-1 (is_utf8() excists, do not know if is_latin-1() does also, maybe use something like Enca which detects the encoding of text files, on the basis of knowledge of their language.)

edited Oct 27 '10 at 09:36

answered Oct 27 '10 at 09:23

bastijn

5,841
5
27
43

will this allow punctuation and special chars as well? The idea is to check for non-English texts – MANU Dec 03 '16 at 15:06

score 1 · Answer 2 · answered Oct 27 '10 at 09:15

1

You can compare the words you receive with a dictionary of english words, for example /usr/share/dict/words on a BSD system.

I would guess that googles results for the most part is grammatically correct, but if not, you might have to look into stemming in order to match against your dictionary.

answered Oct 27 '10 at 09:15

knutin

5,033
19
26

Putting aside the fact that Google's results come from the Internet and are therefore grammatically dubious at best, you're going to have to do stemming anyway. No words file is going to contain every inflection of every word. – Glenn Maynard Oct 27 '10 at 09:38
The question is if that is required. Do we need a 100% accurate result or can we live with an optimal one. Using dictonary + stemming may not be perfect, but may very well be good enough for the TS. – bastijn Oct 27 '10 at 09:41

score 1 · Answer 3 · answered Oct 27 '10 at 09:20

1

You can use PyWordNet. That is a python interface for the WordNet. Just split your sentence on white spaces and check for each word is it in the dictionary.

answered Oct 27 '10 at 09:20

Klark

8,162
3
37
61

Removing non-english words from a sentence in python

3 Answers3

Linked