0

I have a dataframe where one of the columns is 'text' and I'm trying to clean the whole cell if that cell consists of non-english words.

I removed all the punctuation from the cell. I removed all non-ASCI characters from the cell. And I'm trying to import one of the english vocabularies, transform the words to lower case and check if the words in my cell are in that dictionary. However, I don't get any output for that as processing is stacked.

places = []
with open('english-words/words.txt', 'r') as filehandle:
    for line in filehandle:
        currentPlace = line[:-1]
        currentPlace=currentPlace.lower()
        places.append(currentPlace)

def non_eng(texx):
    texx=texx.lower()
    s=[]
    s=texx.split()
    zz=''
    for i in s:
        if i in places:
            zz+=" "+i
    return zz
df['text']=df['text'].map(non_eng)

Is there a better way to check whereas the cell consists of english words and not french/italian or so on?

Elizabeth
  • 11
  • 2
  • what do you mean when you say english words? like is it necessary for the word to be meaningful or it can be any random set of characters? – Debdut Goswami Dec 07 '19 at 09:51
  • 1
    Does this answer your question? [extract English words from string in python](https://stackoverflow.com/questions/25716221/extract-english-words-from-string-in-python) – Debdut Goswami Dec 07 '19 at 09:53
  • I need exactly the english words with meaning, otherwise ASCI would be enough – Elizabeth Dec 07 '19 at 10:01
  • A couple of links for you: Comparing every string entry to a big dictionary can be very slow. There are packages for detecting languages: Check out this: https://pypi.org/project/langdetect/ Once you got the language detected, you can e.g. drop it from the dataframe with this: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html – Dennis Dec 07 '19 at 10:04

1 Answers1

1

Detect strings with non English characters in Python

Please refer to the above given link which talks about identifying non English string.

def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

This function will return a Boolean value saying if the string is English or not.

Hive minD
  • 11
  • 6
  • And what if I need to return another output in the same cell, like return the string back if it's only english-characters? – Elizabeth Dec 07 '19 at 10:00