dropping row containing non-english words in pandas dataframe

Question

I turned this twitter corpus into pandas data frame and I was trying to find the none English tweets and delete them from the data frame, so I did this:

for j in range(0,150):
    if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
           df.drop(j)

 print(df.shape)

but I check the shape, no row was dropped. Am I using the drop function wrong, or do I need to keep track of the index of the row?

score 1 · Accepted Answer · answered Aug 06 '15 at 21:38

1

That's because df.drop() returns a copy instead of modifying your original dataframe. Try set inplace=True

for j in range(0,150):
    if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
           df.drop(j, inplace=True)

print(df.shape)

answered Aug 06 '15 at 21:38

Jianxun Li

24,004
10
58
76

Thank you, that makes perfect sense. – Kailin Huang Aug 07 '15 at 15:38
Traceback (most recent call last): File "/Users/kailinh/PycharmProjects/reverseGeoencoder/filter.py", line 86, in if not wordnet.synsets(df.i[j]):#Comparing if word is non-English File "/Library/Python/2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1406, in synsets lemma = lemma.lower() AttributeError: 'float' object has no attribute 'lower' Does that mean I need to modify the text for synsets to work? – Kailin Huang Aug 07 '15 at 15:40

score 0 · Answer 2 · answered Sep 17 '20 at 05:09

This will filter out all the non-English rows in our pandas dataframe.

import nltk
nltk.download('words')
from nltk.corpus import words
import pandas as pd

data1 = pd.read_csv("testdata.csv")

Word = list(set(words.words()))

df_final = data1[data1['column_name'].str.contains('|'.join(Word))]

print(df_final)

dropping row containing non-english words in pandas dataframe

2 Answers2

Linked