27

I am doing a data cleaning exercise on python and the text that I am cleaning contains Italian words which I would like to remove. I have been searching online whether I would be able to do this on Python using a tool kit like nltk.

For example given some text :

"Io andiamo to the beach with my amico."

I would like to be left with :

"to the beach with my" 

Does anyone know of a way as to how this could be done? Any help would be much appreciated.

Andre Croucher
  • 395
  • 1
  • 3
  • 9

3 Answers3

48

You can use the words corpus from NLTK:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
         if w.lower() in words or not w.isalpha())
# 'Io to the beach with my'

Unfortunately, Io happens to be an English word. In general, it may be hard to decide whether a word is English or not.

DYZ
  • 55,249
  • 10
  • 64
  • 93
  • Edited to preserve non-words (punctuation, numbers, etc.) – DYZ Dec 22 '16 at 19:22
  • Hi, thank you for your answer but when I applied the plural form of noun. Such as resources, boys. It was also removed. Do you know why it happens? – YihanBao Feb 06 '20 at 16:29
  • 1
    The words corpus does not contain the plural forms. You have to do lemmatization first. – DYZ Feb 06 '20 at 17:34
  • 4
    Add the line: `nltk.download('words')` if you are getting `Resource words not found.`. – hafiz031 Feb 16 '21 at 09:28
  • @DYZ is there a way to use `words` corpus on a column of `array` ? Please view my questions [question 1](https://stackoverflow.com/questions/66367953/remove-non-english-words-from-column-in-pyspark) and [question 2](https://stackoverflow.com/questions/66430946/remove-meaningless-words-from-pyspark-column) – Samiksha Mar 02 '21 at 11:52
  • Hey but the nltk corpus words is not exhaustive in nature, it does not contain all the different forms of a word, synonyms of a word, etc... :/ it only contains 235886 unique English words. I tried to check if the word company and companies both exists in this set. I only found company and not companies. Considering this, is there a way to increase the size of the set with more words, different forms and synonyms of the same word? or is there another efficient way to go about this? –  Mar 30 '21 at 09:53
  • @sachinkimars You can do lemmatization before looking up the words in the corpus. – DYZ Mar 30 '21 at 13:23
5

In MAC OSX it still can show an exception if you try this code. So make sure you download the words corpus manually. Once you import your nltk library, make you might as in mac os it does not download the words corpus automatically. So you have to download it potentially otherwise you will face exception.

import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())

Now you can perform same execution as previous person directed.

sent = "Io andiamo to the beach with my amico."
sent = " ".join(w for w in nltk.wordpunct_tokenize(sent) if w.lower() in words or not w.isalpha())

According to NLTK documentation it doesn't say so. But I got a issue over github and solved that way and it really works. If you don't put the word parameter there, you OSX can logg off and happen again and again.

Ananda G
  • 2,389
  • 23
  • 39
  • Hey but the nltk corpus words is not exhaustive in nature, it does not contain all the different forms of a word, synonyms of a word, etc... :/ it only contains 235886 unique English words. I tried to check if the word company and companies both exists in this set. I only found company and not companies. Considering this, is there a way to increase the size of the set with more words, different forms and synonyms of the same word? or is there another efficient way to go about this? –  Mar 30 '21 at 09:53
  • This is where stemming words come in. You can use NLTK to take words back to their root word. for example [ 'cared', 'caring', 'careful'] are all stemmed down to care. You can check the SnowballStemmer – Temitope Babatola Sep 27 '21 at 11:51
-1
from nltk.stem.snowball import SnowballStemmer

snow_stemmer = SnowballStemmer(language='english')
  
#list of words
words = ['cared', 'caring', 'careful']
  
#stem of each word
stem_words = []
for w in words:
    x = snow_stemmer.stem(w)
    stem_words.append(x)
      
#stemming results
for w1,s1 in zip(words,stem_words):
    print(w1+' ----> '+s1)
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103