Removing Non-English Words From Corpus

Question

I have the raw content (text and HTML markup) from thousands of websites. The end goal is exploring topic modeling and clustering. There are many examples of how to filter out non-English words using Python, but unfortunately most do not quite work for the corpus with which I'm working. A few reasons why:

No geographic info is included in the data set, so cannot just filter by English-speaking countries
Even if some geographic data can be inferred (.e.g, a .in top-level domain), there is still the possibility the document extracted from that site will contain English

Here's why the following posts don't quite work in my case:

In python, extracting non-English words was a good start, especially because it also removed punctuation, but it still includes non-English words:

import pandas as pd
from pandas import Series, DataFrame

In  [1]: test_str = Series(['中', 'hello','زندگی','Yo!','かたて く範囲','+44 designer','{{appDetails.title}} {{"TERM','The Pen Company ✒',np.nan,' Shopping Cart:0 Log In/Register'])

In  [2]: test_str.str.findall('[^\W]+')
Out [2]:
0                                       [中]
1                                   [hello]
2                                   [زندگی]
3                                      [Yo]
4                                [かたて, く範囲]
5                            [44, designer]
6                 [appDetails, title, TERM]
7                       [The, Pen, Company]
8                                       NaN
9    [Shopping, Cart, 0, Log, In, Register]
dtype: object

Extract non-content English language words string - python is more about using stop-words, which I am already planning on using, for example:

from nltk.corpus import stopwords
english_stops = stopwords.words('english')
vect = CountVectorizer(max_features=10000,max_df=.2,stop_words=english_stops)

One possibility here though...the Python NLTK shows an example of creating a list of all English-language words:

wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]

which could then be used to filter tokens...however, given the amount of data that seems like a sub-optimal option. Similar approaches would be Removing non-english words from a sentence in python or dropping row containing non-english words in pandas dataframe, but again, using English dictionaries to match word-by-word seems excessive.

An example function from a notebook demonstrating clustering also lets through non-English languages.

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In this case, Finnish words like Etusivu will pass through the filters.

Ideally, any solutions will not take an approach of checking each word in the corpus; that being said, I'm open to whatever path others with more experience have taken (including a word-by-word check) :-)

Have you tried getting a **huge** dictionary of English words and just comparing each element in your arrays to that body of text? I'm talking millions and millions of words' worth of the English language. With a large enough dictionary, you would be able to capture most of the English words in your corpus, and less informative (read: rarely-occurring) words would probably be left out, as they normally should be in text mining. — blacksite, Jan 31 '17 at 21:12
I have not done that yet--there are a few links above that show how to potentially do that, however, I wanted to see if anyone had a method other than checking token-by-token. — measureallthethings, Jan 31 '17 at 21:16
@measureallthethings, do you want to remove single non-english words or complete sentences (if they have non-english words)? — MaxU - stand with Ukraine, Jan 31 '17 at 21:21
@MaxU remove single non-english words...which is a good point...feel like i may have to do token-by-token check — measureallthethings, Jan 31 '17 at 21:24
@measureallthethings Yeah, it's going to be slow (at scale) if you have a bunch of `list` objects instead of `set`s. And you usually don't want to use sets when doing something like topic modeling. Checking whether an element is in a set is O(1), but doing so for each word in your corpus is O(n). — blacksite, Jan 31 '17 at 21:25
@not_a_robot, yeah, i just checked it `words = set([w.lower() for w in nltk.corpus.words.words('en')]); %timeit 'word' in words` gives me: `10000000 loops, best of 3: 176 ns per loop` ;-) — MaxU - stand with Ukraine, Jan 31 '17 at 21:27
there is one problem in this approach though - what are you going to do with words wich are written the same in different languages? — MaxU - stand with Ukraine, Jan 31 '17 at 21:29
@MaxU, hah, haven't thought of that...do you have a quick example? i may need to read up on how prevalent that is — measureallthethings, Jan 31 '17 at 21:31
@measureallthethings, https://en.wikipedia.org/wiki/List_of_orthographically_identical_words_in_English_and_Spanish — MaxU - stand with Ukraine, Jan 31 '17 at 21:33
@measureallthethings, it's going to be a tough fight: https://en.wikipedia.org/wiki/Lists_of_English_words_by_country_or_language_of_origin — MaxU - stand with Ukraine, Jan 31 '17 at 21:37
For posterity's sake, a colleague suggested this to me: https://github.com/Mimino666/langdetect. Haven't implemented fully, but shows promise. One example from the documentation: `>>> detect("War doesn't show who's right, just who's left.") 'en'` — measureallthethings, Feb 01 '17 at 20:43
You need to run a proper word level language ID. It's not hard to build one. — alvas, Feb 02 '17 at 21:05
Any solution for this? I was also matching word-by-word to eliminate non-English words in my corpus, it takes years to finish tho. — Darren Christopher, Sep 09 '18 at 13:32
Any specific reason for only selecting only lower case words for the wordlist? — magma, Jun 28 '20 at 21:31
Welcome to Stackoverflow, I'll encourage you to ask the question here in the discussion instead https://stackoverflow.com/collectives/nlp/beta/discussions. Most probably the questions would be flagged as "asking for tool/fix recommendation" as it is now. — alvas, Aug 28 '23 at 11:53

Removing Non-English Words From Corpus

0 Answers0