0

This code is meant to find the average promotion value in a given month in a two-year period. In total there are about 11,000 rows in the data set that need to be looked over. The code has been running for 5 minutes and the results still haven't been posted. I'm a still very novice in my coding career so any tips onto better optimize code for faster completion times would be appreciated!

import pandas as pd
df = pd.read_csv(r'C:\Users\james.rush\df_LG.csv')
df.head()
Promos = []
Avg_Promo = []
Dates = []

#This function is used to determine the Average Promotion during any given month/year
def Promo_Avg(Date):
    for x in df['Date']: #For all dates in dataframe
        Promo_Value = df.loc[df['Date'] == Date, 'Promo'] #Locate the corresponding promo given the provided date
        Promos.append(Promo_Value) #Add that Promo to the list of Promos for that month, will need list length later
    Average_Promotion = sum(Promos)/len(Promos) #Average Promotion during the given month
    if Average_Promotion not in Avg_Promo: #Prevents Duplicates
        Avg.append(Average_Promotion)
    if Date not in Dates: #If the Current Date being Checked is not in list, add to list. This will prevent Duplicates
        Dates.append(Dates) 

Function_Dates = [
    'January2020',
    'Febuary2020',
    'March2020'
                 ]
for x in Function_Dates:
Promo_Avg(x)
Cosmickid
  • 11
  • 2
  • 1) Much faster to use mean function on DataFrame column e.g. [pandas get column average/mean](https://stackoverflow.com/questions/31037298/pandas-get-column-average-mean) 2) It would be better if the function just returned the average given a date rather than also updating Avg_Promo, Dates (i.e. functions should not do unrelated things). – DarrylG Jul 21 '22 at 16:41
  • It would be helpful to provide some lines from the data file you're processing. – DarrylG Jul 21 '22 at 16:48
  • @DarrylG Thank you for the suggestion, i'll try this function out – Cosmickid Jul 21 '22 at 16:48
  • Using Pandas mean, the function simplifies to: `def promo_avg(date): return df[df.Date==date].Promos.mean()` where I used the Python naming convention for variables and function names. – DarrylG Jul 21 '22 at 16:51

1 Answers1

3

It seems like you are looping over your dataframe with df.loc but without considering your x variable from the for loop, this for loop seems to be useless then.
So you are looping something like 121,000,000 over your df, that might be why it is slow.

Some more details about your question:
Credits to @DarrylG's comment.

You are trying to

find the average promotion value in a given month in a two-year period

This breaks up in 3 parts :
Find, project and average.

def promo_avg(date):
  return df[df.Date==date].Promos.mean()

Seems to do the job, let's see the three parts in details:
Find:
df[df.Date==date] means from df find lines where column Date corresponds to date

Project:
What I mean by project is the projection from the relational algebra. The goal is to restrict your data to some specific columns, in your case, the column Promo. df[df.Date==date].Promo your previous Find part returns a dataframe, so you can project your data simply by doing .Promo.

Average:
After your projection, you still have a dataframe and all the advantages it comes with, including an averaging function. df[df.Date==date].Promos.mean() should do the trick

I hope it was clear and useful :)

xonturis
  • 98
  • 1
  • 5