0

I'm reading in a .csv file using pandas, and then I want to filter out the rows where a specified column's value is not in a dictionary for example. So something like this:

df = pd.read_csv('mycsv.csv', sep='\t', encoding='utf-8', index_col=0, 
    names=['col1', 'col2','col3','col4']) 

c = df.col4.value_counts(normalize=True).head(20)
values = dict(zip(c.index.tolist()[1::2], c.tolist()[1::2])) # Get odd and create dict

df_filtered = filter out all rows where col4 not in values

After searching around a bit I tried using the following to filter it:

df_filtered = df[df.col4 in values]

but that unfortunately didn't work.

I've done the following to make it works for what I want to do, but it's incredibly slow for a large .csv file, so I thought there must be a way to do it that's built in to pandas:

t = [(list(df.col1) + list(df.col2) + list(df.col3)) for i in range(len(df.col4)) if list(df.col4)[i] in values]
user5368737
  • 793
  • 3
  • 12
  • 20

2 Answers2

1

If you want to check against the dictionary values:

df_filtered = df[df.col4.isin(values.values())]

If you want to check against the dictionary keys:

df_filtered = df[df.col4.isin(values.keys())]
Kyrubas
  • 877
  • 8
  • 23
A.Kot
  • 7,615
  • 2
  • 22
  • 24
0

As A.Kot mentioned you could use the values method of the dict to search. But the values method returns either a list or an iterator depending on your version of Python.

If your only reason for creating that dict is membership testing, and you only ever look at the values of the dict then you are using the wrong data structure.

A set will improve your lookup performance, and simplify your check back to:

df_filtered = df[df.col4 in values]

If you use values elsewhere, and you want to check against the keys, then you're ok because membership testing against keys is efficient.