4

My data is absence records from a factory. Some days there are no absences so there is no data or date recorded for that day. However, and where this gets hairy with the other examples shown, is on any given day there can be several absences for various reasons. There is not always a 1 to 1 ratio of date-to-record in the data.

The result I'm hoping for is something like this:

(index)    Shift        Description     Instances (SUM)
01-01-14   2nd Baker    Discipline      0
01-01-14   2nd Baker    Vacation        0
01-01-14   1st Cooks    Discipline      0
01-01-14   1st Cooks    Vacation        0
01-02-14   2nd Baker    Discipline      4
01-02-14   2nd Baker    Vacation        3
01-02-14   1st Cooks    Discipline      3
01-02-14   1st Cooks    Vacation        3

And so on. The idea is all shifts and descriptions will have values for all days in the time period (in this example 1/1/2014 - 12/31/2014)

I've read several examples and the closest I've come to getting this working is here.

ts = pd.read_csv('Absentee_Data_2.csv'
                , encoding = 'utf-8'
                ,parse_dates=[3]
                ,index_col=3
                ,dayfirst=True
                )

idx =  pd.date_range('01.01.2009', '12.31.2017')

ts.index = pd.DatetimeIndex(ts.index)
# ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')

But, when I uncomment the ts = ts.reindex(idx, fill_value='NaN') I get error messages. I've tried at least 10 other ways to accomplish what I'm trying to do so I'm not 100% sure this is the right path, but it seems to have gotten me closest to any kind of progress.

Here's some sample data:

Description Unexcused   Instances   Date        Shift
Discipline  FALSE              1    Jan 2 2014  2nd Baker
Vacation    TRUE               2    Jan 2 2014  1st Cooks
Discipline  FALSE              3    Jan 2 2014  2nd Baker
Vacation    TRUE               1    Jan 2 2014  1st Cooks
Discipline  FALSE              2    Apr 8 2014  2nd Baker
Vacation    TRUE               3    Apr 8 2014  1st Cooks
Discipline  FALSE              1    Jun 1 2014  2nd Baker
Vacation    TRUE               2    Jun 1 2014  1st Cooks
Discipline  FALSE              3    Jun 1 2014  2nd Baker
Vacation    TRUE               1    Jun 1 2014  1st Cooks
Vacation    TRUE               2    Jul 5 2014  1st Cooks
Discipline  FALSE              3    Jul 5 2014  2nd Baker
Vacation    TRUE               2    Dec 3 2014  1st Cooks

Thank you in advance for you help, I'm a newbie and 2 days into this without much progress. I really appreciate how people here help with answers but most importantly instruction on why the solutions work. Newbies like me are very grateful for the wisdom shared.

Programming_Learner_DK
  • 1,509
  • 4
  • 23
  • 49

2 Answers2

5

I think you just have a problem with the use of datetime, this approach worked for me

ts.set_index(['Date'],inplace=True)
ts.index = pd.to_datetime(ts.index,format='%b %d %Y')
d2 = pd.DataFrame(index=pd.date_range('2014-01-01','2014-12-31'))

print ts.join(d2,how='right')
DJK
  • 8,924
  • 4
  • 24
  • 40
  • 2
    Both answers worked but this one was easier for me to understand and loop through with my real data. I did have to do some further manipulation and thinking but this was ultimately the answer I used. – Programming_Learner_DK Aug 08 '17 at 10:44
  • 1
    Worked for me too! I had a "smeidum" dataframe at 900,000 rows and wanted to add missing dates before a pivot. Thanks! – Linwoodc3 May 26 '19 at 13:15
1

Actually you were pretty close of what you wanted (assuming I understood correctly the output you seem to be looking for). See my additions to your code above:

import pandas as pd

ts = pd.read_csv('Absentee_Data_2.csv', encoding = 'utf-8',parse_dates=[3],index_col=3,dayfirst=True, sep=",")

idx =  pd.date_range('01.01.2009', '12.31.2017')

ts.index = pd.DatetimeIndex(ts.index)
#ts = ts.reindex(idx, fill_value='NaN')
df = pd.DataFrame(index = idx)
df1 = df.join(ts, how='left')
df2 = df1.copy()
df3 = df1.copy()
df4 = df1.copy()
dict1 = {'Description': 'Discipline', 'Instances': 0, 'Shift': '1st Cooks'}
df1 = df1.fillna(dict1)
dict1["Description"] = "Vacation"
df2 = df2.fillna(dict1)
dict1["Shift"] = "2nd Baker"
df3 = df3.fillna(dict1)
dict1["Description"] = "Discipline"
df4 = df4.fillna(dict1)
df_with_duplicates = pd.concat([df1,df2,df3,df4])
final_res = df_with_duplicates.reset_index().drop_duplicates(subset=["index"] + list(dict1.keys())).set_index("index").drop("Unexcused", axis=1)

Basically what you'd add:

  • Copy 4 times the almost empty df created with ts (df1)
  • fillna(dict1) allows to fill with static values all the NaN in the columns
  • Concatenate the 4 dfs, we still need to remove some duplicates as the original values from the csv are duplicated 4 times
  • Drop the duplicates, we need the index to keep the values added, thus the reset_index followed by the `set_index("index")
  • Finally drop the Unexcused column

Finally a few output:

In [5]: final_res["2013-01-2"]
Out[5]: 
           Description  Instances      Shift
index                                       
2013-01-02  Discipline        0.0  1st Cooks
2013-01-02    Vacation        0.0  1st Cooks
2013-01-02    Vacation        0.0  2nd Baker
2013-01-02  Discipline        0.0  2nd Baker

In [6]: final_res["2014-01-2"]
Out[6]: 
           Description  Instances       Shift
index                                        
2014-01-02  Discipline        1.0   2nd Baker
2014-01-02    Vacation        2.0   1st Cooks
2014-01-02  Discipline        3.0   2nd Baker
2014-01-02    Vacation        1.0   1st Cooks
1
Adonis
  • 4,670
  • 3
  • 37
  • 57
  • trying this solution but I keep getting the following error: 'Type Error: can only concatenate list (not "dict_keys") to list' on the following line of code: "final_res = (df_with_duplicates.reset_index() .drop_duplicates(subset=["index"] + dict1.keys()) .set_index("index").drop("Unexcused", axis=1))" Any suggestions? Thank you and thank you for the explanation too :) – Programming_Learner_DK Aug 07 '17 at 12:38
  • @SDS My bad a small typo got in, you need to convert the keys of `dict1` to a list, so it should be `subse‌​t=["index"] + list(dict1.keys())`, I edited my post – Adonis Aug 07 '17 at 13:03
  • @SDS If you feel that an answer has been provided please mark it as accepted. It helps keep the focus on unanswered questions. If answers did not help, can you provide a feedback as to what is missing? – Adonis Aug 08 '17 at 10:17