I think you're over-engineering your solution therefore I've opted for a more detailed explaination of the answer.
One way to filter a dataframe is to simply subscript a list/array of booleans. If the length of the array is the same as the length of the dataframe, this will output a view of the dataframe containing only rows aligned with the True values.
Here is an example:
import pandas as pd
df = pd.DataFrame({
'numbers': [0,1,2,3,4],
'letters': ['a','b','c','d','e'],
'colors': ['red', 'blue', 'yellow', 'green', 'purple']
})
df
Which outputs:
|
numbers |
letters |
colors |
0 |
0 |
a |
red |
1 |
1 |
b |
blue |
2 |
2 |
c |
yellow |
3 |
3 |
d |
green |
4 |
4 |
e |
purple |
This is what I mean by subscripting a boolean list (not sure if this is accepted terminology)
boolean_list = [True, True, False, True, False]
filtered_df = df[boolean_list]
filtered_df
Which outputs:
|
numbers |
letters |
colors |
0 |
0 |
a |
red |
1 |
1 |
b |
blue |
3 |
3 |
d |
green |
We can use simple arguments to produce this boolean list from a dataframe
df['numbers']>2
Outputs:
0 False
1 False
2 False
3 True
4 True
Name: numbers, dtype: bool
We can streamline the filtering with this redundant looking piece of code:
df[df['numbers']>2]
outputs:
|
numbers |
letters |
colors |
3 |
3 |
d |
green |
4 |
4 |
e |
purple |
While it looks redundant, all we've done there is subscribe a list of booleans. As written, this does not change df at all, for that we would need to do df = df[filter_argument]
For more complicated filtering we can use .apply() to get our list of booleans. Say we only want rows where the letter in 'letters' is present in the color in 'colors':
def letter_in_color(row):
return row['letters'] in row['colors']
boolean_arr = df.apply(letter_in_color, axis = 1)
print(boolean_arr)
0 False
1 True
2 False
3 False
4 True
dtype: bool
letter_in_color_df = df[boolean_array]
letter_in_color_df
|
numbers |
letters |
colors |
1 |
1 |
b |
blue |
4 |
4 |
e |
purple |
I did this long explaination because while the concept of filtering a df with a boolean array is quite simple, looking at code which does that often looks weird or redundant and it isn't clear what is really going on.
I hope you didn't stop reading there:
because there is an important and powerful tool which you can add to the above situations to preclude many errors and unexpected behavior: ".loc[]" This is a more explicit and powerful indexer, and in all of the above cases we can gain its benefits with very few changes:
df[boolean_array] becomes df.loc[boolean_array]
For more information about df.loc[] instead of df[] see this answer