1

I'm trying to remove elements from a list (list is stored in a pandas dataframe) with .remove(). The base idea is that i iterate through all the rows in the dataframe then every element in the row (=list), and check whether that particular element is a keeper or a "goner"

data=dict()
data=pd.read_csv('raw_output_v2.csv', names=['ID','Body'])
data['Body']=data['Body'].apply(eval)  
keyword_dict={}
for row in tqdm(data['Body'], desc="building dict"):
    for word in row:
        if word in keyword_dict:
            keyword_dict[word]+=1
        else:
            keyword_dict[word]=1 

new_df=remove_sparse_words_from_df(data, keyword_dict, cutoff=1_000_000)

And here is the important stuff:

def remove_sparse_words_from_df(df, term_freq, cutoff=1):
    i=0
    for row in tqdm(df['Body'],desc="cleaning df"):
        for word in row:
            if term_freq[word]<=cutoff:
                row.remove(word)
            else:
                continue
    return df

I've uploaded a short example csv to be used here: https://pastebin.com/g25bHCC7.

My problem is: the remove_sparse_words_from_df function removes some occurances of the words that fall below cutoff, but not all. Example: the word "clean" occurs ~10k in the original dataframe (data), after running remove_sparse_words_from_df about 2k still remains. Same with other words.

What am I missing?

Jozsef
  • 37
  • 3

1 Answers1

0

You're modifying your list (row.remove) while iterating over it (for word in row:). You can see here, here and here, why this may be a problem:

Modifying a sequence while iterating over it can cause undesired behavior due to the way the iterator is build. To avoid this problem, a simple solution is to iterate over a copy of the list... using the slice notation with default values list_1[:]

    ...
    for row in tqdm(df['Body'],desc="cleaning df"):
        for word in row[:]:
            if term_freq[word]<=cutoff:
                row.remove(word)
    ...

Cutoff set as 1_000_000

                   ID Body
0  (1483785165, 2009)   []
1  (1538280431, 2010)   []
2  (1795044103, 2010)   []
...
...
n1colas.m
  • 3,863
  • 4
  • 15
  • 28