I have a pandas dataframe where each column has only a few possible values (e.g. column animal can only be 'cat', 'dog', 'mouse' and nothing else). I need to write a function loop(df, col)
where df
is the dataframe and col
is the name of the column which is special compared to all other columns. The special part is that I want to iterate over all columns, and then finally iterate over the last column. The algorithm I have come up with is as follows:
# Use pandas query to get all rows for which the columns exactly match the dictionary
def pd_query(df, queryDict):
# Enable querying non-string datatypes like integers
strwrap = lambda val: '"' + val + '"' if isinstance(val, str) else str(val)
query = ' and '.join([colname + '==' + strwrap(val) for colname, val in queryDict.items()])
return df.query(query)
def f(df, col):
# 1. Explicitly exclude special column from the data frame and remove duplicates
df2 = df.drop(col).drop_duplicates()
# 2. Loop over rows of new dataframe - effectively looping over all columns except that column
for idx2, row2 in df2.iterrows():
do_some_preliminary_stuff(row2)
# 3. Query this row to the original dictionary, thus getting rows which only differ in the special column
df3 = pd_query(df, dict(row2))
# 4. Loop over special column
for idx3, row3 in df3.iterrows():
do_more_stuff(row3)
My questions are:
- Is there a faster/less clumsy algorithm to do this? It feels wasteful making repeated queries to the larger dataframe
- Is this the intended way to use pandas query function?
- Can one query a row directly without having to convert it into a fancy string?
- Currently this comfortably handles strings, bools and ints, but can't handle tuples. Can this be improved to handle tuples?
Edit: Based on suggestion by @QuangHoang, the following seems to work well
import pandas as pd
df = pd.DataFrame({
'Animal' : ['Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Mouse', 'Mouse'],
'isSmart' : [True, False, True, True, False, False, True],
'Color' : ['W', 'B', 'W', 'B', 'W', 'B', 'W']
})
def f(df, col):
colsEff = list(set(df.columns) - set([col]))
for col_vals, data in df.groupby(colsEff):
for idx3, row3 in data.iterrows():
print(list(row3.values))
If we call f(df, 'Animal')
we get the following, meaning that the loop over animals is the inner loop as expected.
['Cat', False, 'B']
['Mouse', False, 'B']
['Dog', False, 'W']
['Dog', True, 'B']
['Cat', True, 'W']
['Dog', True, 'W']
If instead we try f(df, 'isSmart')
the inner loop is over smartness
['Cat', False, 'B']
['Cat', True, 'W']
['Dog', True, 'B']
['Dog', True, 'W']
['Dog', False, 'W']
['Mouse', False, 'B']
['Mouse', True, 'W']
So this part of the question is resolved. I still welcome some suggestions on how to stabilize the pd_query
function to work with tuples, as I might need it later