Programatically query a dataframe with mixed types

Question

I have a dataframe with mixed types

import pandas as pd
df = pd.DataFrame({'gender':list('MMMFFF'),
                   'height':[4,5,4,5,5,4],
                   'age':[70,80,90,40,2,3]})

print(df)

I need to be able to take pairs of columns and filter the dataframe on those pairs of columns. So, I have seen loc and query and it looks like I need to query but programmatically.

This link got me some of the way there, but to really be able to do this, I need to be able to programatically know and change the types in the query. Look at how they do it with known types:

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
df.query(query)

I don't have the luxury to know what the types will be when my pairs of columns come in. Is there a way to make this more flexible and introduce testing for types and then implementing the right query for the right types? By this, I mean that I need to add quotes but only if my values are strings and I need to do this by testing the type.

Edit: It looks like I solved this the following way:

yQuery = ' & '.join(['{}=={}'.format(self.query_type_setter(k,True),self.query_type_setter(v, False)) for k, v in yVal])


def query_type_setter(self, value, isColumn):
    """
        sets the query value depending on whether or not it is a string
    :param value:
    :return:
    """
    if isColumn:
        return "`" + value + "`"
    if isinstance(value, str):
        return "'" + value + "'"
    else:
        return value

I added in a function in the right spot to detect for strings.

Perhaps I'm not understanding the example correctly, but what would be the right query for the right type? How would the example change if the types were not known? — jqurious, May 31 '23 at 01:13

score 1 · Answer 1 · answered May 31 '23 at 01:13

1

Lets try to create a more flexible query with df.query() by determining the data type of the column when you are constructing the query string.

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

query = ' & '.join(f'{i} {j} {repr(k) if df[i].dtype == "object" else k}' for i, j, k in zip(column, equal, condition))

filtered_df = df.query(query)

answered May 31 '23 at 01:13

Saxtheowl

4,136
5
23
32

1

This produces the *exact same result* as the OP's. What exactly is the advantage here? – juanpa.arrivillaga May 31 '23 at 01:17

score 0 · Answer 2 · answered May 31 '23 at 14:13

Instead of using .query and strings, perhaps another option is:

comparison = [ ('height', 5), ('gender', 'F'), ('age', 40) ]
np.column_stack([df[column] == value for column, value in comparison])

array([[False, False, False],
       [ True, False, False],
       [False, False, False],
       [ True,  True,  True],
       [ True,  True, False],
       [False,  True, False]])

You can then .all(axis=1) to get a row-based filter:

cols = np.column_stack([
   df[column] == value for column, value in comparison
])
   
df[cols.all(axis=1)]

  gender  height  age
3      F       5   40

If you want to use other comparison methods, you can map the string value to the corresponding method e.g. == -> df.eq

ops = { '==': 'eq', '>': 'gt' }

column = 'height'
value = 4

getattr(df[column], ops['>'])(value)

0    False
1     True
2    False
3     True
4     True
5    False
Name: height, dtype: bool

Programatically query a dataframe with mixed types

2 Answers2