I have a dataframe where one of the columns is 'text' and I'm trying to clean the whole cell if that cell consists of non-english words.
I removed all the punctuation from the cell. I removed all non-ASCI characters from the cell. And I'm trying to import one of the english vocabularies, transform the words to lower case and check if the words in my cell are in that dictionary. However, I don't get any output for that as processing is stacked.
places = []
with open('english-words/words.txt', 'r') as filehandle:
for line in filehandle:
currentPlace = line[:-1]
currentPlace=currentPlace.lower()
places.append(currentPlace)
def non_eng(texx):
texx=texx.lower()
s=[]
s=texx.split()
zz=''
for i in s:
if i in places:
zz+=" "+i
return zz
df['text']=df['text'].map(non_eng)
Is there a better way to check whereas the cell consists of english words and not french/italian or so on?