I have a mock DataFrame below that has some sales items for different supermarkets and their cost:
Step 1
import pandas as pd
list_of_customers =[
['patrick','lemon','fruit','citrus',10,'tesco'],
['paul','lemon','fruit','citrus',20,'tesco'],
['frank','lemon','fruit','citrus',10,'tesco'],
['jim','lemon','fruit','citrus',20,'tesco'],
['wendy','watermelon','fruit','',39,'tesco'],
['greg','watermelon','fruit','',32,'sainsburys'],
['wilson','carrot','vegetable','',34,'sainsburys'],
['maree','carrot','vegetable','',22,'aldi'],
['greg','','','','','aldi'],
['wilmer','sprite','drink','',22,'aldi'],
['jed','lime','fruit','citrus',40,'tesco'],
['michael','lime','fruit','citrus',12,'aldi'],
['andrew','','','','33','aldi'],
['ahmed','lime','fruit','fruit',33,'aldi']
]
df = pd.DataFrame(list_of_customers,columns = ['customer','item','item_type','fruit_type','cost','store'])
(df)
I create separate dataframes based on the store (i.e tesco, sainsburies, aldi etc)
Step 2
#create a separate dataframe for each store
tesco_df = df.where(df['store'] == 'tesco') .dropna()
sainsburys_df = df.where(df['store'] == 'sainsburys').dropna()
aldi_df = df.where(df['store'] == 'aldi') .dropna()
list_of_stores = [tesco_df,sainsburys_df,aldi_df]
I then define a set of rules/conditions for each dataframe that are used to define categories where I want to get the total cost for each. Note this is just a small example I have created, the actual dataset I am working on would contain many more variables.
Step 3
#define variable for each value we need to aggregate (real dataset would have many mutually exclusive rules)
fruit_variable = df['item_type'].isin(['fruit'])
vegetable_variable = df['item_type'].isin(['vegetable'])
citrus_variable = df['fruit_type'].isin(['citrus'])
I then go on to create a function that calculates the sum of the cost for each of the variables I defined. In the function I union each dataframe to create a final data frame with all of the sums.
Step 4
def compute_sum_variable(condition,field):
df_var_name = df.where(condition).dropna().groupby([field])[['cost']].sum().reset_index()
df_var_name.columns = ['item','cost']
return df_var_name
fruit_df = compute_sum_variable(fruit_variable,'item_type')
vegetable_df = compute_sum_variable(vegetable_variable,'item_type')
citrus_df = compute_sum_variable(citrus_variable,'fruit_type')
cost_df = pd.concat([fruit_df,vegetable_df,citrus_df],ignore_index = True)
print(cost_df)
#we need to create cost_df_tesco,cost_df_sainsburies,cost_df_aldi
My question here is how would I apply this to each of the dataframes I created [tesco_df,sainsburys_df,aldi_df], rather then just the original dataframe 'df', so that I get a new cost dataframe for each store , i.e. cost_df_tesco,cost_df_sainsburies,cost_df_aldi.
The process also needs to run step 3 for all of the separate dataframes. Should I convert step 3 to a function that runs for each entity? I thought of this but then it would result in far to many variables which would then need to go into the function in step 4.
My question is then as follows: How can I apply step 3 and step 4 so that they run for each dataframe and create a new total cost dataframe for each.
Please let me know if this doesn't make sense. I tried to apply the pipe + list comprehension in the following topic but had no luck applying this rationale.