How do you filter pandas dataframes by multiple columns?

Question

To filter a DataFrame (df) by a single column, if we consider data with male and females we might:

males = df[df[Gender]=='Male']

Question 1: But what if the data spanned multiple years and I wanted to only see males for 2014?

In other languages I might do something like:

if A = "Male" and if B = "2014" then

(except I want to do this and get a subset of the original DataFrame in a new dataframe object)

Question 2: How do I do this in a loop, and create a dataframe object for each unique sets of year and gender (i.e. a df for: 2013-Male, 2013-Female, 2014-Male, and 2014-Female?

for y in year:

for g in gender:

df = .....

Do you want to *filter* it or *group* it? If you want to create a separate DataFrame for each unique set of year and gender, look at `groupby`. — BrenBarn, Feb 28 '14 at 04:35
[This answer](https://stackoverflow.com/a/54358361/4909087) gives a comprehensive overview of boolean indexing and logical operators in pandas. — cs95, Jan 25 '19 at 06:17

score 278 · Accepted Answer · edited Jan 31 '23 at 02:17

278

Using & operator, don't forget to wrap the sub-statements with ():

males = df[(df[Gender]=='Male') & (df[Year]==2014)]

To store your DataFrames in a dict using a for loop:

from collections import defaultdict
dic={}
for g in ['male', 'female']:
    dic[g]=defaultdict(dict)
    for y in [2013, 2014]:
        dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict

A demo for your getDF:

def getDF(dic, gender, year):
    return dic[gender][year]

print genDF(dic, 'male', 2014)

edited Jan 31 '23 at 02:17

Henry Ecker

34,399
18
41
57

answered Feb 28 '14 at 04:40

zhangxaochen

32,744
15
77
108

great answer zhangxaochen - could you edit your answer to show at the bottom how you might do a for loop, which creates the dataframes (with year and gender data) but adds them to a dictionary so they can be accessed later by my getDF method? def GetDF(dict,key): return dict[key] – yoshiserry Feb 28 '14 at 05:11
@yoshiserry what's the `key` like in your `getDF`? a single parameter or a tuple of keys? be specific plz ;) – zhangxaochen Feb 28 '14 at 05:21
hi it's a single key, just a word, that would correspond to the gender (male, or female) or year (13, 14) Didn't know you could have a tuple of keys. Could you share an example of when and how you would do this? – yoshiserry Feb 28 '14 at 05:24
could you have a look at this question too. I feel like you could answer it. Relates to pandas dataframes again. http://stackoverflow.com/questions/22086619/how-to-apply-a-function-to-multiple-columns-in-a-pandas-dataframe-at-one-time – yoshiserry Feb 28 '14 at 05:26
is it possible to turn the gender dataframe column into a list to iterate over with a for loop? – yoshiserry Mar 11 '14 at 12:21
@yoshiserry what? better to post your question in detail ;) – zhangxaochen Mar 11 '14 at 12:30
3

Note that the `Gender` and `Year` should both be strings, i.e., `'Gender'` and `'Year'`. – Steven C. Howell May 04 '17 at 19:47

score 38 · Answer 2 · answered Mar 22 '19 at 06:03

38

Start from pandas 0.13, this is the most efficient way.

df.query('Gender=="Male" & Year=="2014" ')

answered Mar 22 '19 at 06:03

redreamality

1,034
10
11

2

Why should this be more efficient than the accepted answer? – Bouncner Jul 04 '19 at 08:11
@Bouncner just verify it against the high-voted answer. – redreamality Oct 11 '19 at 15:27
28

This answer could be improved by showing the benchmark – nardeas Dec 04 '19 at 13:19
2

This is not efficient. Using %timeit, `df.query(blah)` scored 1.81 ms ± 99.7 µs per loop (7 runs, 1000 loops each), whereas `df[(blah) & (blah)]` scored faster at 501 µs ± 15.3 µs per loop (7 runs, 1000 loops each) – blackraven May 08 '22 at 02:38
Why not `Year==2014` instead of `Year=="2014"`? – kiradotee Mar 09 '23 at 14:04

score 37 · Answer 3 · answered Nov 21 '19 at 10:56

In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):

import pandas as pd
import numpy as np

length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)

%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]

Results for 100,000 rows:

6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results for 10,000,000 rows:

326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So results depend on the size and the data. On my laptop, query() gets faster after 500k rows. Further, the string search in Year=="2014" has an unnecessary overhead (Year==2014 is faster).

However, I think the `query` syntax is neater and close to SQL, which makes it nice for data since. The chery on the cake is that it's faster with many rows :) — csgroen, Aug 27 '20 at 15:58

score 29 · Answer 4 · answered Oct 02 '16 at 18:37

29

For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:

df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]

where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).

answered Oct 02 '16 at 18:37

guibor

550
5
6

6

A fleshed out example where you also define f would improve this answer. – user239558 Apr 16 '21 at 11:00
Great idea! Use df[df[["col_1", "col_2"]].apply(lambda x: True if tuple(x.values) == ("val_1", "val_2") else False, axis=1)] to filter by a tuple of desired values for specific columns, for example. Or even shorter, df[df[["col_1", "col_2"]].apply(lambda x: tuple(x.values) == ("val_1", "val_2"), axis=1)] – Anatoly Alekseev Jun 28 '22 at 12:21

score 8 · Answer 5 · answered Sep 07 '21 at 03:50

8

Since you are looking for a rows that basically meet a condition where Column_A='Value_A' and Column_B='Value_B'

you can do using loc

df = df.loc[df['Column_A'].eq('Value_A') & df['Column_B'].eq('Value_B')]

You can find full doc here panda loc

answered Sep 07 '21 at 03:50

Kaish kugashia

234
2
8

score 6 · Answer 6 · answered Dec 10 '19 at 21:34

6

You can create your own filter function using query in pandas. Here you have filtering of df results by all the kwargs parameters. Dont' forgot to add some validators(kwargs filtering) to get filter function for your own df.

def filter(df, **kwargs):
    query_list = []
    for key in kwargs.keys():
        query_list.append(f'{key}=="{kwargs[key]}"')
    query = ' & '.join(query_list)
    return df.query(query)

answered Dec 10 '19 at 21:34

Alex

1,221
2
26
42

Thanks for the elegant solution! I think it's the best out of all the rest. It combines the efficiency of using query with the versatility of having it as a function. – A Merii Jul 27 '20 at 09:37
2

Note that this assumes the value `kwargs[key]` is a string; it can be made a bit more generic (at least ints and strings) by something like `val = kwargs[key]` and `val_str = f'"{val}"' if isinstance(val, str) else f'{str(val)}` and `query_list.append(f'{key}=={val_str}')` – THK Feb 26 '21 at 19:27

score 2 · Answer 7 · answered Sep 06 '19 at 13:26

You can filter by multiple columns (more than two) by using the np.logical_and operator to replace & (or np.logical_or to replace |)

Here's an example function that does the job, if you provide target values for multiple fields. You can adapt it for different types of filtering and whatnot:

def filter_df(df, filter_values):
    """Filter df by matching targets for multiple columns.

    Args:
        df (pd.DataFrame): dataframe
        filter_values (None or dict): Dictionary of the form:
                `{<field>: <target_values_list>}`
            used to filter columns data.
    """
    import numpy as np
    if filter_values is None or not filter_values:
        return df
    return df[
        np.logical_and.reduce([
            df[column].isin(target_values) 
            for column, target_values in filter_values.items()
        ])
    ]

Usage:

df = pd.DataFrame({'a': [1, 2, 3, 4], 'b': [1, 2, 3, 4]})

filter_df(df, {
    'a': [1, 2, 3],
    'b': [1, 2, 4]
})

score 1 · Answer 8 · answered Aug 22 '22 at 06:01

An improvement to Alex answer

def df_filter(df, **kwargs):
    query_list = []
    for key, value in kwargs.items():
        if value is not None:
            query_list.append(f"{key}==@kwargs['{str(key)}']")
    query = ' & '.join(query_list)
    return df.query(query)

will remove None values so can be directly incoperated to functions with some values defaulting to None also the previous one would not work if the value was not string , this will work on any type of arguments

score 0 · Answer 9 · answered Jul 21 '22 at 17:31

After a few years I came back to this question and can propose another solution, it's especially good when you have lots of filters included. We can create a several filtering masks and then operate on those filters:

>>> df = pd.DataFrame({'gender': ['Male', 'Female', 'Male'],
...                    'married': [True, False, False]})
>>> gender_mask = df['gender'] == 'Male'
>>> married_mask = df['married']
>>> filtered_df = df.loc[gender_mask & married_mask]
>>> filtered_df
  gender  married
0   Male     True

Maybe it's not the shortest solution, but it's readable and could be a great help to organize the code.

score 0 · Answer 10 · answered Sep 22 '22 at 06:00

My dataframe has 25 columns and I want to leave for future a freedom to choice any kind of filters (num of params, conditions). I use this:

    
def flex_query(params):
    res = load_dataframe()
    if type(params) is not list:
        return None
    for el in params:
        res = res.query(f"{el[0]} {el[1]} {el[2]}")
    return res

And calling this:

res = flex_query([['DATE','==', '"2022-09-26"'],['LEVEL','>=',2], ['PERCENT','>',10.2]])

Where 'DATE', 'LEVEL', 'PERCENT' - column names. As you can see, here are very flexible query method with several params and different type of conditions. This method gives me possibility to compare int, float, string - 'all in one'

How do you filter pandas dataframes by multiple columns?

10 Answers10

Linked

Related