0

I turned this twitter corpus into pandas data frame and I was trying to find the none English tweets and delete them from the data frame, so I did this:

for j in range(0,150):
    if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
           df.drop(j)

 print(df.shape)

but I check the shape, no row was dropped. Am I using the drop function wrong, or do I need to keep track of the index of the row?

Kailin Huang
  • 45
  • 1
  • 5

2 Answers2

1

That's because df.drop() returns a copy instead of modifying your original dataframe. Try set inplace=True

for j in range(0,150):
    if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
           df.drop(j, inplace=True)

print(df.shape)
Jianxun Li
  • 24,004
  • 10
  • 58
  • 76
  • Thank you, that makes perfect sense. – Kailin Huang Aug 07 '15 at 15:38
  • Traceback (most recent call last): File "/Users/kailinh/PycharmProjects/reverseGeoencoder/filter.py", line 86, in if not wordnet.synsets(df.i[j]):#Comparing if word is non-English File "/Library/Python/2.7/site-packages/nltk/corpus/reader/wordnet.py", line 1406, in synsets lemma = lemma.lower() AttributeError: 'float' object has no attribute 'lower' Does that mean I need to modify the text for synsets to work? – Kailin Huang Aug 07 '15 at 15:40
0

This will filter out all the non-English rows in our pandas dataframe.

import nltk
nltk.download('words')
from nltk.corpus import words
import pandas as pd

data1 = pd.read_csv("testdata.csv")

Word = list(set(words.words()))

df_final = data1[data1['column_name'].str.contains('|'.join(Word))]

print(df_final)

Ghost
  • 492
  • 4
  • 10