0

I'm building a text classifier that will classify text into topics.

In the first phase of my program as a part of cleaning the data, I remove all the non-English words. For this I'm using the nltk.corpus.words.words() corpus. The problem with this corpus is that it removes 'modern' English words such as Facebook, Instagram etc. Does anybody know another, more 'modern' corpus which I can replace or union with the present one?

I prefer nltk corpus but I'm open to other suggestions.

Thanks in advance

Ksofiac
  • 382
  • 1
  • 6
  • 21
user4550050
  • 621
  • 1
  • 6
  • 21
  • Probably more relevant than the marked "duplicate": https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python – alexis Jun 15 '17 at 14:15

2 Answers2

1

Rethink your approach. Any collection of English texts will have a "long tail" of words that you have not seen before. No matter how large a dictionary you amass, you'll be removing words that are not "non-English". And to what purpose? Leave them in, they won't spoil your classification.

If your goal is to remove non-English text, do it at the sentence or paragraph level using a statistical approach, e.g. ngram models. They work well and need minimal resources.

alexis
  • 48,685
  • 16
  • 101
  • 161
0

I'd use Wikipedia, but it's pretty time consuming to tokenize the entirety of it. Fortunately, it's been done for you already. You could use a Word2Vec model trained on 100 billion words of wikipedia and just check if the word is in the model.

I also found this project where Chris made text files of the 3 millions word vocabulary of the model.

Note that this project's list of words doesn't contain some stop words, so it'd be a good idea to find the union of your list from nltk and this one.

aberger
  • 2,299
  • 4
  • 17
  • 29