How to find all missing date for a given month for all employees

Question

I am using python pandas to calculate efficiency of the employees. I have a data frame describing employees of some company. Each employee have unique employee id. The data frame shows monthly record of the number of hours for all employees. So there might be some days missing from DF for each employee. So those dates range have to filled as zero rows with column dates as missing date and Id as employee id. Example -

Employee WH       Date    C3 C4 C5 
11        6  2021-06-03   -   -  -
11        7  2021-06-06
11        8  2021-06-08
13        5  2021-06-01
13        7  2021-06-02
13        7  2021-06-28

The missing date for employee id 11 is 01,02,04,05,07,09---30. The missing date for employee id 13 is 03,--27, 29,30. Like so there can be multiple employees with missing date range. The DF needs to be filled with all those missing values having Id and date and the rest of the columns as 0. and to be reindexed.

So you want 30 days for each employee but their WH are 0 for those days? — Akshay Sehgal, Aug 13 '21 at 19:29
Yes. Let say an employee worked 20 days out of a month so i want all 10 days he have't worked with WH as 0. — Sam, Aug 13 '21 at 19:30
This is by no means the optimal solution but what you could try is 1.) Groupby employees by Employee ID 2.) Use apply to Series where the function calculates the missing days. If you don't care about efficiency, you could make a set out of the work days, make a second set of days 1-30, then take the difference. — tcglezen, Aug 13 '21 at 19:41
So how i can read group by DF columns in a set? Like all date value for a given employee inn set S. — Sam, Aug 13 '21 at 19:45
Check my solution. i am adding more comments for better understanding. — Akshay Sehgal, Aug 13 '21 at 19:49
I have linked another post in the first line of my answer. Do visit that one as well and check it out. That was the inspiration for my answer. — Akshay Sehgal, Aug 13 '21 at 19:57
@Sam, can you check the solution with dynamic date range, please? — Corralien, Aug 13 '21 at 20:16

Jim · Answer 1 · 2021-08-13T20:13:18.997

This can be accomplished by reindexing

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

You'll need to first construct the new index you want (all employees, daily frequency), then set the identifying columns in the original dataframe as the index, and finally reindex and specify the fill value as 0.

id_cols = ['Date','Employee']

new_index = pd.MultiIndex.from_product(
        [pd.date_range(start='2021-06-01', end='2021-06-30', freq='D'),
         list_of_unique_employee_IDs],
    names = id_cols
    )

df2 = df.set_index(id_cols).reindex(new_index, fill_value = 0)

If you don't already have a list of all of the unique employee IDs, you can instead just get it from your original df with df.Employee.unique()

If you want to go back to the default integer index rather than keeping the MultiIndex of Employee and Date, you can add a .reset_index() to the end of the last line and it will insert Employee and Date back as columns in the dataframe

This is a great approach as well! Starting with completely new index for employees as well. — Akshay Sehgal, Aug 13 '21 at 20:14
One of my other thoughts that I didn't really specify with this approach is that it also allows filling in 0s for every day for an employee you know exists, but don't have any data for in the original dataframe — Jim, Aug 13 '21 at 20:23

score 0 · Answer 2 · answered Aug 13 '21 at 20:15

Instead using static date_range, you can generate it automatically:

def missing_dates_from_series(sr):
    return pd.date_range(df['Date'].min().strftime('%Y-%m'),
                         (df['Date'].max() + pd.DateOffset(months=1)).strftime('%Y-%m'),
                         closed='left', freq='D')


mi = pd.MultiIndex.from_frame(df.groupby('Employee')['Date']
                                .apply(missing_dates_from_series)
                                .explode().reset_index())

out = df.set_index(['Employee', 'Date']).reindex(mi, fill_value=0).reset_index()

>>> out
    Employee       Date  WH
0         11 2021-06-01   0
1         11 2021-06-02   0
2         11 2021-06-03   6
3         11 2021-06-04   0
4         11 2021-06-05   0
5         11 2021-06-06   7
6         11 2021-06-07   0
7         11 2021-06-08   8
8         11 2021-06-09   0
9         11 2021-06-10   0
10        11 2021-06-11   0
11        11 2021-06-12   0
12        11 2021-06-13   0
13        11 2021-06-14   0
14        11 2021-06-15   0
15        11 2021-06-16   0
16        11 2021-06-17   0
17        11 2021-06-18   0
18        11 2021-06-19   0
19        11 2021-06-20   0
20        11 2021-06-21   0
21        11 2021-06-22   0
22        11 2021-06-23   0
23        11 2021-06-24   0
24        11 2021-06-25   0
25        11 2021-06-26   0
26        11 2021-06-27   0
27        11 2021-06-28   0
28        11 2021-06-29   0
29        11 2021-06-30   0
30        13 2021-06-01   5
31        13 2021-06-02   7
32        13 2021-06-03   0
33        13 2021-06-04   0
34        13 2021-06-05   0
35        13 2021-06-06   0
36        13 2021-06-07   0
37        13 2021-06-08   0
38        13 2021-06-09   0
39        13 2021-06-10   0
40        13 2021-06-11   0
41        13 2021-06-12   0
42        13 2021-06-13   0
43        13 2021-06-14   0
44        13 2021-06-15   0
45        13 2021-06-16   0
46        13 2021-06-17   0
47        13 2021-06-18   0
48        13 2021-06-19   0
49        13 2021-06-20   0
50        13 2021-06-21   0
51        13 2021-06-22   0
52        13 2021-06-23   0
53        13 2021-06-24   0
54        13 2021-06-25   0
55        13 2021-06-26   0
56        13 2021-06-27   0
57        13 2021-06-28   7
58        13 2021-06-29   0
59        13 2021-06-30   0

Nice function for auto generating dates. – Akshay Sehgal Aug 13 '21 at 20:17 — Akshay Sehgal, Aug 13 '21 at 20:17

Akshay Sehgal · Accepted Answer · 2021-08-13T20:41:29.167

Try the pd.DataFrame.reindex method. The inspiration for this solution is taken from this excellent post. Since it is a dataframe instead of series, a few extra steps will be needed to get to the expected output, as shown below.

idx = pd.date_range('2021-06-01', '2021-06-30') #Set your date range

df.set_index('Date', inplace=True)
df.index = pd.DatetimeIndex(df.index)

output = df.groupby('Employee').apply(pd.DataFrame.reindex, idx, fill_value=0)\
                               .drop('Employee',1)\
                               .reset_index()\
                               .rename(columns={'level_1':'Date'})

print(output)

    Employee       Date  WH
0         11 2021-06-01   0
1         11 2021-06-02   0
2         11 2021-06-03   5
3         11 2021-06-04   7
4         11 2021-06-05   0
5         11 2021-06-06   0
6         11 2021-06-07   0
7         11 2021-06-08   0
8         11 2021-06-09   0
9         11 2021-06-10   0
10        11 2021-06-11   0
11        11 2021-06-12   0
12        11 2021-06-13   0
13        11 2021-06-14   0
14        11 2021-06-15   0
15        11 2021-06-16   0
16        11 2021-06-17   0
17        11 2021-06-18   0
18        11 2021-06-19   0
19        11 2021-06-20   0
20        11 2021-06-21   0
21        11 2021-06-22   0
22        11 2021-06-23   0
23        11 2021-06-24   0
24        11 2021-06-25   0
25        11 2021-06-26   0
26        11 2021-06-27   0
27        11 2021-06-28   0
28        11 2021-06-29   0
29        11 2021-06-30   0
30        13 2021-06-01   0
31        13 2021-06-02   0
32        13 2021-06-03   0
33        13 2021-06-04   8
34        13 2021-06-05   5
35        13 2021-06-06   0
36        13 2021-06-07   0
37        13 2021-06-08   0
38        13 2021-06-09   0
39        13 2021-06-10   0
40        13 2021-06-11   0
41        13 2021-06-12   0
42        13 2021-06-13   0
43        13 2021-06-14   0
44        13 2021-06-15   0
45        13 2021-06-16   0
46        13 2021-06-17   0
47        13 2021-06-18   0
48        13 2021-06-19   0
49        13 2021-06-20   0
50        13 2021-06-21   0
51        13 2021-06-22   0
52        13 2021-06-23   0
53        13 2021-06-24   0
54        13 2021-06-25   0
55        13 2021-06-26   0
56        13 2021-06-27   0
57        13 2021-06-28   0
58        13 2021-06-29   0
59        13 2021-06-30   0

One typo error is there . output = s.groupby('id').apply(f).reset_index().rename(columns={'level_1':'date'}) x.groupby('id') — Sam, Aug 13 '21 at 20:09
I am storing the original dataframe with missing rows in s. While the lambda function works on a parameter called x. .. so Lambda function would have x and you have to then apply it on the `s.groupby` — Akshay Sehgal, Aug 13 '21 at 20:13

How to find all missing date for a given month for all employees

3 Answers3