-2

I am using python pandas to calculate efficiency of the employees. I have a data frame describing employees of some company. Each employee have unique employee id. The data frame shows monthly record of the number of hours for all employees. So there might be some days missing from DF for each employee. So those dates range have to filled as zero rows with column dates as missing date and Id as employee id. Example -

Employee WH       Date    C3 C4 C5 
11        6  2021-06-03   -   -  -
11        7  2021-06-06
11        8  2021-06-08
13        5  2021-06-01
13        7  2021-06-02
13        7  2021-06-28

The missing date for employee id 11 is 01,02,04,05,07,09---30. The missing date for employee id 13 is 03,--27, 29,30. Like so there can be multiple employees with missing date range. The DF needs to be filled with all those missing values having Id and date and the rest of the columns as 0. and to be reindexed.

Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
Sam
  • 31
  • 3
  • So you want 30 days for each employee but their WH are 0 for those days? – Akshay Sehgal Aug 13 '21 at 19:29
  • Yes. Let say an employee worked 20 days out of a month so i want all 10 days he have't worked with WH as 0. – Sam Aug 13 '21 at 19:30
  • This is by no means the optimal solution but what you could try is 1.) Groupby employees by Employee ID 2.) Use apply to Series where the function calculates the missing days. If you don't care about efficiency, you could make a set out of the work days, make a second set of days 1-30, then take the difference. – tcglezen Aug 13 '21 at 19:41
  • So how i can read group by DF columns in a set? Like all date value for a given employee inn set S. – Sam Aug 13 '21 at 19:45
  • Check my solution. i am adding more comments for better understanding. – Akshay Sehgal Aug 13 '21 at 19:49
  • I didn't did it. I will run this code. – Sam Aug 13 '21 at 19:55
  • I have linked another post in the first line of my answer. Do visit that one as well and check it out. That was the inspiration for my answer. – Akshay Sehgal Aug 13 '21 at 19:57
  • @Sam, can you check the solution with dynamic date range, please? – Corralien Aug 13 '21 at 20:16

3 Answers3

0

This can be accomplished by reindexing

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

You'll need to first construct the new index you want (all employees, daily frequency), then set the identifying columns in the original dataframe as the index, and finally reindex and specify the fill value as 0.

id_cols = ['Date','Employee']

new_index = pd.MultiIndex.from_product(
        [pd.date_range(start='2021-06-01', end='2021-06-30', freq='D'),
         list_of_unique_employee_IDs],
    names = id_cols
    )

df2 = df.set_index(id_cols).reindex(new_index, fill_value = 0)

If you don't already have a list of all of the unique employee IDs, you can instead just get it from your original df with df.Employee.unique()

If you want to go back to the default integer index rather than keeping the MultiIndex of Employee and Date, you can add a .reset_index() to the end of the last line and it will insert Employee and Date back as columns in the dataframe

Jim
  • 398
  • 1
  • 6
  • This is a great approach as well! Starting with completely new index for employees as well. – Akshay Sehgal Aug 13 '21 at 20:14
  • One of my other thoughts that I didn't really specify with this approach is that it also allows filling in 0s for every day for an employee you know exists, but don't have any data for in the original dataframe – Jim Aug 13 '21 at 20:23
0

Instead using static date_range, you can generate it automatically:

def missing_dates_from_series(sr):
    return pd.date_range(df['Date'].min().strftime('%Y-%m'),
                         (df['Date'].max() + pd.DateOffset(months=1)).strftime('%Y-%m'),
                         closed='left', freq='D')


mi = pd.MultiIndex.from_frame(df.groupby('Employee')['Date']
                                .apply(missing_dates_from_series)
                                .explode().reset_index())

out = df.set_index(['Employee', 'Date']).reindex(mi, fill_value=0).reset_index()
>>> out
    Employee       Date  WH
0         11 2021-06-01   0
1         11 2021-06-02   0
2         11 2021-06-03   6
3         11 2021-06-04   0
4         11 2021-06-05   0
5         11 2021-06-06   7
6         11 2021-06-07   0
7         11 2021-06-08   8
8         11 2021-06-09   0
9         11 2021-06-10   0
10        11 2021-06-11   0
11        11 2021-06-12   0
12        11 2021-06-13   0
13        11 2021-06-14   0
14        11 2021-06-15   0
15        11 2021-06-16   0
16        11 2021-06-17   0
17        11 2021-06-18   0
18        11 2021-06-19   0
19        11 2021-06-20   0
20        11 2021-06-21   0
21        11 2021-06-22   0
22        11 2021-06-23   0
23        11 2021-06-24   0
24        11 2021-06-25   0
25        11 2021-06-26   0
26        11 2021-06-27   0
27        11 2021-06-28   0
28        11 2021-06-29   0
29        11 2021-06-30   0
30        13 2021-06-01   5
31        13 2021-06-02   7
32        13 2021-06-03   0
33        13 2021-06-04   0
34        13 2021-06-05   0
35        13 2021-06-06   0
36        13 2021-06-07   0
37        13 2021-06-08   0
38        13 2021-06-09   0
39        13 2021-06-10   0
40        13 2021-06-11   0
41        13 2021-06-12   0
42        13 2021-06-13   0
43        13 2021-06-14   0
44        13 2021-06-15   0
45        13 2021-06-16   0
46        13 2021-06-17   0
47        13 2021-06-18   0
48        13 2021-06-19   0
49        13 2021-06-20   0
50        13 2021-06-21   0
51        13 2021-06-22   0
52        13 2021-06-23   0
53        13 2021-06-24   0
54        13 2021-06-25   0
55        13 2021-06-26   0
56        13 2021-06-27   0
57        13 2021-06-28   7
58        13 2021-06-29   0
59        13 2021-06-30   0
Corralien
  • 109,409
  • 8
  • 28
  • 52
-1

Try the pd.DataFrame.reindex method. The inspiration for this solution is taken from this excellent post. Since it is a dataframe instead of series, a few extra steps will be needed to get to the expected output, as shown below.

idx = pd.date_range('2021-06-01', '2021-06-30') #Set your date range

df.set_index('Date', inplace=True)
df.index = pd.DatetimeIndex(df.index)

output = df.groupby('Employee').apply(pd.DataFrame.reindex, idx, fill_value=0)\
                               .drop('Employee',1)\
                               .reset_index()\
                               .rename(columns={'level_1':'Date'})

print(output)
    Employee       Date  WH
0         11 2021-06-01   0
1         11 2021-06-02   0
2         11 2021-06-03   5
3         11 2021-06-04   7
4         11 2021-06-05   0
5         11 2021-06-06   0
6         11 2021-06-07   0
7         11 2021-06-08   0
8         11 2021-06-09   0
9         11 2021-06-10   0
10        11 2021-06-11   0
11        11 2021-06-12   0
12        11 2021-06-13   0
13        11 2021-06-14   0
14        11 2021-06-15   0
15        11 2021-06-16   0
16        11 2021-06-17   0
17        11 2021-06-18   0
18        11 2021-06-19   0
19        11 2021-06-20   0
20        11 2021-06-21   0
21        11 2021-06-22   0
22        11 2021-06-23   0
23        11 2021-06-24   0
24        11 2021-06-25   0
25        11 2021-06-26   0
26        11 2021-06-27   0
27        11 2021-06-28   0
28        11 2021-06-29   0
29        11 2021-06-30   0
30        13 2021-06-01   0
31        13 2021-06-02   0
32        13 2021-06-03   0
33        13 2021-06-04   8
34        13 2021-06-05   5
35        13 2021-06-06   0
36        13 2021-06-07   0
37        13 2021-06-08   0
38        13 2021-06-09   0
39        13 2021-06-10   0
40        13 2021-06-11   0
41        13 2021-06-12   0
42        13 2021-06-13   0
43        13 2021-06-14   0
44        13 2021-06-15   0
45        13 2021-06-16   0
46        13 2021-06-17   0
47        13 2021-06-18   0
48        13 2021-06-19   0
49        13 2021-06-20   0
50        13 2021-06-21   0
51        13 2021-06-22   0
52        13 2021-06-23   0
53        13 2021-06-24   0
54        13 2021-06-25   0
55        13 2021-06-26   0
56        13 2021-06-27   0
57        13 2021-06-28   0
58        13 2021-06-29   0
59        13 2021-06-30   0
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51