How do I get functions to run for multiple dataframes without having to define countless variables within the function?

Question

I have a mock DataFrame below that has some sales items for different supermarkets and their cost:

Step 1

import pandas as pd

list_of_customers =[
['patrick','lemon','fruit','citrus',10,'tesco'],
['paul','lemon','fruit','citrus',20,'tesco'],
['frank','lemon','fruit','citrus',10,'tesco'],
['jim','lemon','fruit','citrus',20,'tesco'], 
['wendy','watermelon','fruit','',39,'tesco'],
['greg','watermelon','fruit','',32,'sainsburys'],
['wilson','carrot','vegetable','',34,'sainsburys'],    
['maree','carrot','vegetable','',22,'aldi'],
['greg','','','','','aldi'], 
['wilmer','sprite','drink','',22,'aldi'],
['jed','lime','fruit','citrus',40,'tesco'],    
['michael','lime','fruit','citrus',12,'aldi'],
['andrew','','','','33','aldi'], 
['ahmed','lime','fruit','fruit',33,'aldi'] 
]

df = pd.DataFrame(list_of_customers,columns = ['customer','item','item_type','fruit_type','cost','store'])

(df)

I create separate dataframes based on the store (i.e tesco, sainsburies, aldi etc)

Step 2

#create a separate dataframe for each store 

tesco_df = df.where(df['store'] == 'tesco') .dropna()

sainsburys_df = df.where(df['store'] == 'sainsburys').dropna()

aldi_df = df.where(df['store'] == 'aldi') .dropna()

list_of_stores = [tesco_df,sainsburys_df,aldi_df]

I then define a set of rules/conditions for each dataframe that are used to define categories where I want to get the total cost for each. Note this is just a small example I have created, the actual dataset I am working on would contain many more variables.

Step 3

#define variable for each value we need to aggregate (real dataset would have many mutually exclusive rules) 

fruit_variable = df['item_type'].isin(['fruit'])

vegetable_variable = df['item_type'].isin(['vegetable'])

citrus_variable = df['fruit_type'].isin(['citrus'])

I then go on to create a function that calculates the sum of the cost for each of the variables I defined. In the function I union each dataframe to create a final data frame with all of the sums.

Step 4

def compute_sum_variable(condition,field):
    df_var_name =  df.where(condition).dropna().groupby([field])[['cost']].sum().reset_index()
    df_var_name.columns = ['item','cost']
    return df_var_name

fruit_df = compute_sum_variable(fruit_variable,'item_type')
vegetable_df = compute_sum_variable(vegetable_variable,'item_type')
citrus_df = compute_sum_variable(citrus_variable,'fruit_type')
cost_df = pd.concat([fruit_df,vegetable_df,citrus_df],ignore_index = True)

print(cost_df)

#we need to create cost_df_tesco,cost_df_sainsburies,cost_df_aldi

My question here is how would I apply this to each of the dataframes I created [tesco_df,sainsburys_df,aldi_df], rather then just the original dataframe 'df', so that I get a new cost dataframe for each store , i.e. cost_df_tesco,cost_df_sainsburies,cost_df_aldi.

The process also needs to run step 3 for all of the separate dataframes. Should I convert step 3 to a function that runs for each entity? I thought of this but then it would result in far to many variables which would then need to go into the function in step 4.

My question is then as follows: How can I apply step 3 and step 4 so that they run for each dataframe and create a new total cost dataframe for each.

Please let me know if this doesn't make sense. I tried to apply the pipe + list comprehension in the following topic but had no luck applying this rationale.

How to apply function to multiple pandas dataframe

score 0 · Answer 1 · answered Dec 28 '22 at 17:47

You could use a dictionary to store the DF for each store e.g.

produce_types = [
    {
        'category': 'item_type',
        'value': 'fruit'
    },
    {
        'category': 'item_type',
        'value': 'vegetable'
    },
    {
        'category': 'fruit_type',
        'value': 'citrus'
    },
]

data = [
    ['patrick','lemon','fruit','citrus',10,'tesco'],
    ['paul','lemon','fruit','citrus',20,'tesco'],
    ['frank','lemon','fruit','citrus',10,'tesco'],
    ['jim','lemon','fruit','citrus',20,'tesco'], 
    ['wendy','watermelon','fruit','',39,'tesco'],
    ['greg','watermelon','fruit','',32,'sainsburys'],
    ['wilson','carrot','vegetable','',34,'sainsburys'],    
    ['maree','carrot','vegetable','',22,'aldi'],
    ['greg','','','','','aldi'], 
    ['wilmer','sprite','drink','',22,'aldi'],
    ['jed','lime','fruit','citrus',40,'tesco'],    
    ['michael','lime','fruit','citrus',12,'aldi'],
    ['andrew','','','','33','aldi'], 
    ['ahmed','lime','fruit','fruit',33,'aldi'] 
]

columns = ['customer','item','item_type','fruit_type','cost','store']

def get_cost_df(df, category, value):
    cost_df = df[df[category] == value].dropna().groupby(category).cost.sum().reset_index()
    cost_df.columns = ['item', 'cost']
    return cost_df

def aggregate_cost_dfs(df):
    return pd.concat(
        [get_cost_df(df, **produce_type) for produce_type in produce_types],
        ignore_index=True
    )

df = pd.DataFrame(data, columns=columns)
dfs = {store: aggregate_cost_dfs(df[df.store == store].copy()) for store in df.store.unique()}

ok I might break this one down 'produce_types'. – Patty Dec 29 '22 at 08:44 — Patty, Dec 29 '22 at 08:44

How do I get functions to run for multiple dataframes without having to define countless variables within the function?

1 Answers1