I'm trying to remove elements from a list (list is stored in a pandas dataframe) with .remove()
. The base idea is that i iterate through all the rows in the dataframe then every element in the row (=list), and check whether that particular element is a keeper or a "goner"
data=dict()
data=pd.read_csv('raw_output_v2.csv', names=['ID','Body'])
data['Body']=data['Body'].apply(eval)
keyword_dict={}
for row in tqdm(data['Body'], desc="building dict"):
for word in row:
if word in keyword_dict:
keyword_dict[word]+=1
else:
keyword_dict[word]=1
new_df=remove_sparse_words_from_df(data, keyword_dict, cutoff=1_000_000)
And here is the important stuff:
def remove_sparse_words_from_df(df, term_freq, cutoff=1):
i=0
for row in tqdm(df['Body'],desc="cleaning df"):
for word in row:
if term_freq[word]<=cutoff:
row.remove(word)
else:
continue
return df
I've uploaded a short example csv to be used here: https://pastebin.com/g25bHCC7.
My problem is: the remove_sparse_words_from_df
function removes some occurances of the words that fall below cutoff, but not all. Example: the word "clean" occurs ~10k in the original dataframe (data), after running remove_sparse_words_from_df
about 2k still remains. Same with other words.
What am I missing?