1

I have a dataframe with mixed types

import pandas as pd
df = pd.DataFrame({'gender':list('MMMFFF'),
                   'height':[4,5,4,5,5,4],
                   'age':[70,80,90,40,2,3]})

print(df)

enter image description here

I need to be able to take pairs of columns and filter the dataframe on those pairs of columns. So, I have seen loc and query and it looks like I need to query but programmatically.

This link got me some of the way there, but to really be able to do this, I need to be able to programatically know and change the types in the query. Look at how they do it with known types:

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
df.query(query)

I don't have the luxury to know what the types will be when my pairs of columns come in. Is there a way to make this more flexible and introduce testing for types and then implementing the right query for the right types? By this, I mean that I need to add quotes but only if my values are strings and I need to do this by testing the type.

Edit: It looks like I solved this the following way:

yQuery = ' & '.join(['{}=={}'.format(self.query_type_setter(k,True),self.query_type_setter(v, False)) for k, v in yVal])


def query_type_setter(self, value, isColumn):
    """
        sets the query value depending on whether or not it is a string
    :param value:
    :return:
    """
    if isColumn:
        return "`" + value + "`"
    if isinstance(value, str):
        return "'" + value + "'"
    else:
        return value

I added in a function in the right spot to detect for strings.

mathlete42
  • 111
  • 4
  • 1
    Perhaps I'm not understanding the example correctly, but what would be the right query for the right type? How would the example change if the types were not known? – jqurious May 31 '23 at 01:13

2 Answers2

1

Lets try to create a more flexible query with df.query() by determining the data type of the column when you are constructing the query string.

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

query = ' & '.join(f'{i} {j} {repr(k) if df[i].dtype == "object" else k}' for i, j, k in zip(column, equal, condition))

filtered_df = df.query(query)
Saxtheowl
  • 4,136
  • 5
  • 23
  • 32
0

Instead of using .query and strings, perhaps another option is:

comparison = [ ('height', 5), ('gender', 'F'), ('age', 40) ]
np.column_stack([df[column] == value for column, value in comparison])
array([[False, False, False],
       [ True, False, False],
       [False, False, False],
       [ True,  True,  True],
       [ True,  True, False],
       [False,  True, False]])

You can then .all(axis=1) to get a row-based filter:

cols = np.column_stack([
   df[column] == value for column, value in comparison
])
   
df[cols.all(axis=1)]
  gender  height  age
3      F       5   40

If you want to use other comparison methods, you can map the string value to the corresponding method e.g. == -> df.eq

ops = { '==': 'eq', '>': 'gt' }

column = 'height'
value = 4

getattr(df[column], ops['>'])(value)
0    False
1     True
2    False
3     True
4     True
5    False
Name: height, dtype: bool
jqurious
  • 9,953
  • 1
  • 4
  • 14