I have the raw content (text and HTML markup) from thousands of websites. The end goal is exploring topic modeling and clustering. There are many examples of how to filter out non-English words using Python, but unfortunately most do not quite work for the corpus with which I'm working. A few reasons why:
- No geographic info is included in the data set, so cannot just filter by English-speaking countries
- Even if some geographic data can be inferred (.e.g, a
.in
top-level domain), there is still the possibility the document extracted from that site will contain English
Here's why the following posts don't quite work in my case:
In python, extracting non-English words was a good start, especially because it also removed punctuation, but it still includes non-English words:
import pandas as pd
from pandas import Series, DataFrame
In [1]: test_str = Series(['中', 'hello','زندگی','Yo!','かたて く範囲','+44 designer','{{appDetails.title}} {{"TERM','The Pen Company ✒',np.nan,' Shopping Cart:0 Log In/Register'])
In [2]: test_str.str.findall('[^\W]+')
Out [2]:
0 [中]
1 [hello]
2 [زندگی]
3 [Yo]
4 [かたて, く範囲]
5 [44, designer]
6 [appDetails, title, TERM]
7 [The, Pen, Company]
8 NaN
9 [Shopping, Cart, 0, Log, In, Register]
dtype: object
Extract non-content English language words string - python is more about using stop-words, which I am already planning on using, for example:
from nltk.corpus import stopwords
english_stops = stopwords.words('english')
vect = CountVectorizer(max_features=10000,max_df=.2,stop_words=english_stops)
One possibility here though...the Python NLTK shows an example of creating a list of all English-language words:
wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
which could then be used to filter tokens...however, given the amount of data that seems like a sub-optimal option. Similar approaches would be Removing non-english words from a sentence in python or dropping row containing non-english words in pandas dataframe, but again, using English dictionaries to match word-by-word seems excessive.
An example function from a notebook demonstrating clustering also lets through non-English languages.
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
In this case, Finnish words like Etusivu
will pass through the filters.
Ideally, any solutions will not take an approach of checking each word in the corpus; that being said, I'm open to whatever path others with more experience have taken (including a word-by-word check) :-)