0

I have a pandas dataframe where each column has only a few possible values (e.g. column animal can only be 'cat', 'dog', 'mouse' and nothing else). I need to write a function loop(df, col) where df is the dataframe and col is the name of the column which is special compared to all other columns. The special part is that I want to iterate over all columns, and then finally iterate over the last column. The algorithm I have come up with is as follows:

# Use pandas query to get all rows for which the columns exactly match the dictionary 
def pd_query(df, queryDict):
    # Enable querying non-string datatypes like integers
    strwrap = lambda val: '"' + val + '"' if isinstance(val, str) else str(val)
    query = ' and '.join([colname + '==' + strwrap(val) for colname, val in queryDict.items()])
    return df.query(query)

def f(df, col):
    # 1. Explicitly exclude special column from the data frame and remove duplicates
    df2 = df.drop(col).drop_duplicates()

    # 2. Loop over rows of new dataframe - effectively looping over all columns except that column
    for idx2, row2 in df2.iterrows():
        do_some_preliminary_stuff(row2)

        # 3. Query this row to the original dictionary, thus getting rows which only differ in the special column 
        df3 = pd_query(df, dict(row2))
        
        # 4. Loop over special column
        for idx3, row3 in df3.iterrows():
            do_more_stuff(row3)

My questions are:

  • Is there a faster/less clumsy algorithm to do this? It feels wasteful making repeated queries to the larger dataframe
  • Is this the intended way to use pandas query function?
    • Can one query a row directly without having to convert it into a fancy string?
    • Currently this comfortably handles strings, bools and ints, but can't handle tuples. Can this be improved to handle tuples?

Edit: Based on suggestion by @QuangHoang, the following seems to work well

import pandas as pd

df = pd.DataFrame({
    'Animal' : ['Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Mouse', 'Mouse'],
    'isSmart' : [True, False, True, True, False, False, True],
    'Color'  : ['W', 'B', 'W', 'B', 'W', 'B', 'W']
})

def f(df, col):
    colsEff = list(set(df.columns) - set([col]))
    
    for col_vals, data in df.groupby(colsEff):
        for idx3, row3 in data.iterrows():
            print(list(row3.values))
            

If we call f(df, 'Animal') we get the following, meaning that the loop over animals is the inner loop as expected.

['Cat', False, 'B']
['Mouse', False, 'B']
['Dog', False, 'W']
['Dog', True, 'B']
['Cat', True, 'W']
['Dog', True, 'W']

If instead we try f(df, 'isSmart') the inner loop is over smartness

['Cat', False, 'B']
['Cat', True, 'W']
['Dog', True, 'B']
['Dog', True, 'W']
['Dog', False, 'W']
['Mouse', False, 'B']
['Mouse', True, 'W']

So this part of the question is resolved. I still welcome some suggestions on how to stabilize the pd_query function to work with tuples, as I might need it later

Aleksejs Fomins
  • 688
  • 1
  • 8
  • 20
  • 2
    Please include sample input data and expected output data (no images). Please see https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples on how you can do this, and I'm sure someone could answer your question more easily and effectively. Thank you! – David Erickson Jan 07 '21 at 22:22
  • 1
    @DavidErickson Thanks, I'll do as you suggest – Aleksejs Fomins Jan 07 '21 at 22:23

1 Answers1

1

It looks like you are trying to do a groupby:

for col_vals, data in df.groupby(col):
    for idx3, row3 in data.iterrows()
        do_stuff(row3.drop(col))

That said, depending on do_stuff function, it might be just one single

 df.groupby(col).apply(do_stuff)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • thanks a lot for the reply. It makes a lot of sense. I have tried it. I think that what I want is to group by everything else, except col. Would it work if I simply specify all other columns except of the desired column? – Aleksejs Fomins Jan 07 '21 at 22:31