0

this is my first question, so bear with me if i am doing something wrong :) I am working with a data set for 2018, 2019, 2020 and 2021. (The data set is complete with a file for each day, i have checked the files manually) I am trying to sum some values with pandas by Groupby and datetime. The end goal is to have a sum value for each day for the whole data set. For some reason, the year 2020 is skipping the 4th in every month, and i cant figure out why. i have seartch the stackoverflow, datacamp and others for many hours now without a solution. Now i have to try ask you guys.

My code starts by making a list with all the filenames i my folder. After that i replace "." with ":" so i can convert the dates to datetime64[ns] (there are different date formats in the files) the code append the dates to a list and set them in the column. returns the files by concat in a single dataframe. after this i am making a new column where i look for some words in another column. this new column is used for sorting the dataframe. i use .loc for this. then i groupby.dt.date and .sum then if a date has a value of zero i need it to show so i use .asfreq("D") and .fillna(0)

its all good, but every 4th (4/1/2020, 4/2/2020, 4/3/2020, 4/4/2020, 4/5/2020 and so on) is missing. But WHYYY! :) (The file are there. if i take all the files for them self, its all good. The values are there.)

def FileList(): #Creates a list with all the files
    CSVFileList = []
    FilePath = 'Data/'
    FileDirectoryList = os.listdir(FilePath)
    for i in FileDirectoryList:
        if(i[-3:] == "CSV"):
            CSVFileList.append(i)
    return CSVFileList
def Concat(filelist): #Concat all the files in the list from FileList
    columns_names = ["Order_ID", "Dato", "Varenummer", "Varenavn", "Quantity", "Pris", "Pris_Sum", "Tilbehør", "col1", "col2"]
    dflist = []
    for i in filelist:
        df = pd.read_csv("Data/"+i, delimiter = ";", names = columns_names)
        DatoList = []
        for i in df["Dato"]:
            u = i.replace(".",":")
            DatoList.append(u)
        df["Dato"] = DatoList
        df['Dato'] = pd.to_datetime(df['Dato']) #Convert date's to Datetime64[ns]
        dflist.append(df)
    dataframe = pd.concat(dflist)
    return dataframe
def BestillingDF(filename):    
    
    df = filename
    #New Vare_Beskrivelse
    Vare_navn = []
    Varenavn = df["Varenavn"]
    for i in Varenavn:
        i = str(i)
        i = i.lower()
        if "spareribs" in i:
            Vare_navn.append("Spareribs")
        elif "salat" in i:
            Vare_navn.append("Salat")
        elif "pommes" in i:
            Vare_navn.append("Pommes Frites")
        else:
            Vare_navn.append("")
            
    df["Vare_Navn"] = Vare_navn
    Spareribs_Vægt_List = []
    spareribsvægt = bi.Spareribs_Vægt() #Gets a value from a dict in another file
    for i in df["Varenavn"]:
        if i not in spareribsvægt:
            Spareribs_Vægt_List.append(0)
        else:
            Spareribs_Vægt_List.append(spareribsvægt[i])
    df["Spareribs_Vægt_Kuvert"] = Spareribs_Vægt_List
    df["Spareribs_Vægt_[Q*K]"] = df["Quantity"] * df["Spareribs_Vægt_Kuvert"]
    return df
def Spareribs_A_Day(filename):
    data = BestillingDF(filename)
    spare = data.loc[data["Vare_Navn"] == "Spareribs"] #Sorts the dataframe 

    s = spare.groupby([spare['Dato'].dt.date])['Spareribs_Vægt_Kuvert'].sum() #Groups the series by date/day and sums the value

    t = s.asfreq('D') #Shows the zero values (NaN)

    df = pd.DataFrame(t)
    df['Spareribs_Vægt_Kuvert'] = df['Spareribs_Vægt_Kuvert'].fillna(0) #Fills the NAN by zero
    return df
def main():
        
    filelist = FileList()
    concat_DF = Concat(filelist)
    ribs = Spareribs_A_Day(concat_DF)
    print(ribs.head())

main()

I am no champion in python or coding. but i hope that you guys maybe can help me anyhow :) the code result

  • 1
    Include sample data as text and expected output: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – David Erickson Jun 28 '21 at 00:39
  • 1
    We'd need to see examples of the values of the date column which are not appearing in the output, Otherwise we are just guessing as to what is going wrong. – John D Jun 28 '21 at 01:06

1 Answers1

0

Thanks for the quick respons from David Erickson & John D. i will remember to post Data Sample next time i post a question. (the problem with this data sample is that it is very big. And the problem will not show up when the data sample is "perfect")

but, but, but, I found the answer... in a way.... :)

my .csv files have a Date inside the file. the one i was using. the date format changed over time, and i tried to tackle that. i changed the dates to the same format, and it looked like it was all good. but evry time the .day turned shifted to the same number as .month (4/4/2020 or 7/7/2020, 9/9/2020, and so on) the format changed (%Y-%m-%d & %Y-%d-%m) then when i used

spare.groupby([spare['CleanD'].dt.date])['SpareVægtSum'].sum()

The sum was all wired. its not the hole problem, but it was a pattern i found. my solution was to get the Date from the filename and the -1 day like:

def CleanD():
    file = FileList()
    df_list = []
    for i in file:
        columns_names = ["Order_ID", "Dato", "Varenummer", "Varenavn", "Quantity", "Pris", "Pris_Sum", "Tilbehør", "col1", "col2"]
        df = pd.read_csv("Data/"+i, delimiter = ";", names = columns_names)
        cleanD_list = []
        for u in df["Dato"]:
            cleanD_list.append(i[0:10])
        df["CleanD"] = cleanD_list
        df_list.append(df)
    df = pd.concat(df_list)
    
    df['CleanD']= pd.to_datetime(df['CleanD'], format='%Y-%m-%d')
    
    cleanD_list = []
    for i in df["CleanD"]:
        u = i + pd.Timedelta(days=-1)
        cleanD_list.append(u)
    df['CleanD'] =  cleanD_list
    return df

And now its all good :) don't know why .datetime is acting like that, maybe someone knows what/why? but its fixed, and i have used so mutch time on it, that im happy with the solution i have found this time :=)