2

I have a dataframe with multiple string columns, one date column and one int value column.

I want to ffill the missing dates for each group of text columns. The missing dates are all dates from the min date to max date in the dataframe. I think this is better explained using an example.

Sample Input:

group        rtype     location   hardware     date          value
my-group     type-s     NY         DTop        2020-08-05     10
my-group     type-s     NY         DTop        2020-08-07     20
my-group     type-s     NY         DTop        2020-08-10     30
my-group     type-s     NY         Tower       2020-08-01     40
my-group     type-s     NY         Tower       2020-08-07     50
ot-group     type-t     NY         LTop        2020-08-08     90

Min and Max date for this dataframe: (start_date) 2020-08-01 - (end_date) 2020-08-10

Sample Output:

group       rtype       location   hardware    date          value
my-group    type-s        NY          DTop     2020-08-01    0
my-group    type-s        NY          DTop     2020-08-02    0
my-group    type-s        NY          DTop     2020-08-03    0
my-group    type-s        NY          DTop     2020-08-04    0
my-group    type-s        NY          DTop     2020-08-05    10
my-group    type-s        NY          DTop     2020-08-06    10
my-group    type-s        NY          DTop     2020-08-07    20
my-group    type-s        NY          DTop     2020-08-08    20
my-group    type-s        NY          DTop     2020-08-09    20
my-group    type-s        NY          DTop     2020-08-10    30
my-group    type-s        NY          Tower    2020-08-01    40
my-group    type-s        NY          Tower    2020-08-02    40
my-group    type-s        NY          Tower    2020-08-03    40
my-group    type-s        NY          Tower    2020-08-04    40
my-group    type-s        NY          Tower    2020-08-05    40
my-group    type-s        NY          Tower    2020-08-06    40
my-group    type-s        NY          Tower    2020-08-07    50
my-group    type-s        NY          Tower    2020-08-08    50
my-group    type-s        NY          Tower    2020-08-09    50
my-group    type-s        NY          Tower    2020-08-10    50
ot-group    type-t        NY          LTop     2020-08-01    0
ot-group    type-t        NY          LTop     2020-08-02    0
ot-group    type-t        NY          LTop     2020-08-03    0
ot-group    type-t        NY          LTop     2020-08-04    0
ot-group    type-t        NY          LTop     2020-08-05    0
ot-group    type-t        NY          LTop     2020-08-06    0
ot-group    type-t        NY          LTop     2020-08-07    0
ot-group    type-t        NY          LTop     2020-08-08    90
ot-group    type-t        NY          LTop     2020-08-09    90
ot-group    type-t        NY          LTop     2020-08-10    90

In this example, I kept the location fixed to avoid an extra long output. I am able to get the dates I want using pd.date_range().

I tried using resample with multiindex but I run into errors (similar to this).

I tried the approach mentioned in this answer but it doesn't seem to work:

My code using:

import pandas as pd

df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):    
    return df.reindex(date_range).ffill()

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())

Output of this code:

10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
               '2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
               '2020-08-09', '2020-08-10'],
              dtype='datetime64[ns]', freq='D')
            group  rtype  location  hardware  value
2020-08-01    NaN    NaN       NaN       NaN    NaN
2020-08-02    NaN    NaN       NaN       NaN    NaN
2020-08-03    NaN    NaN       NaN       NaN    NaN
2020-08-04    NaN    NaN       NaN       NaN    NaN
2020-08-05    NaN    NaN       NaN       NaN    NaN
2020-08-06    NaN    NaN       NaN       NaN    NaN
2020-08-07    NaN    NaN       NaN       NaN    NaN
2020-08-08    NaN    NaN       NaN       NaN    NaN
2020-08-09    NaN    NaN       NaN       NaN    NaN
2020-08-10    NaN    NaN       NaN       NaN    NaN
2020-08-01    NaN    NaN       NaN       NaN    NaN
2020-08-02    NaN    NaN       NaN       NaN    NaN
2020-08-03    NaN    NaN       NaN       NaN    NaN
2020-08-04    NaN    NaN       NaN       NaN    NaN
2020-08-05    NaN    NaN       NaN       NaN    NaN
2020-08-06    NaN    NaN       NaN       NaN    NaN
2020-08-07    NaN    NaN       NaN       NaN    NaN
2020-08-08    NaN    NaN       NaN       NaN    NaN
2020-08-09    NaN    NaN       NaN       NaN    NaN
2020-08-10    NaN    NaN       NaN       NaN    NaN
2020-08-01    NaN    NaN       NaN       NaN    NaN
2020-08-02    NaN    NaN       NaN       NaN    NaN
2020-08-03    NaN    NaN       NaN       NaN    NaN
2020-08-04    NaN    NaN       NaN       NaN    NaN
2020-08-05    NaN    NaN       NaN       NaN    NaN
2020-08-06    NaN    NaN       NaN       NaN    NaN
2020-08-07    NaN    NaN       NaN       NaN    NaN
2020-08-08    NaN    NaN       NaN       NaN    NaN
2020-08-09    NaN    NaN       NaN       NaN    NaN
2020-08-10    NaN    NaN       NaN       NaN    NaN

Can someone help please?

EDIT: After fixing the DatetimeIndex issue, and using fillna(0):

df = pd.read_csv('data.csv', parse_dates=['date'])
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')

print(len(date_range), date_range)
def reindex_by_date(df):    
    return df.reindex(date_range).ffill().fillna(0)

df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True).reset_index().rename(columns={'index': 'date'})

print(df.to_string())

Output:

               group   rtype location hardware  value
2020-08-01         0       0        0        0    0.0
2020-08-02         0       0        0        0    0.0
2020-08-03         0       0        0        0    0.0
2020-08-04         0       0        0        0    0.0
2020-08-05  my-group  type-s       NY     DTop   10.0
2020-08-06  my-group  type-s       NY     DTop   10.0
2020-08-07  my-group  type-s       NY     DTop   20.0
2020-08-08  my-group  type-s       NY     DTop   20.0
2020-08-09  my-group  type-s       NY     DTop   20.0
2020-08-10  my-group  type-s       NY     DTop   30.0
2020-08-01  my-group  type-s       NY    Tower   40.0
2020-08-02  my-group  type-s       NY    Tower   40.0
2020-08-03  my-group  type-s       NY    Tower   40.0
2020-08-04  my-group  type-s       NY    Tower   40.0
2020-08-05  my-group  type-s       NY    Tower   40.0
2020-08-06  my-group  type-s       NY    Tower   40.0
2020-08-07  my-group  type-s       NY    Tower   50.0
2020-08-08  my-group  type-s       NY    Tower   50.0
2020-08-09  my-group  type-s       NY    Tower   50.0
2020-08-10  my-group  type-s       NY    Tower   50.0
2020-08-01         0       0        0        0    0.0
2020-08-02         0       0        0        0    0.0
2020-08-03         0       0        0        0    0.0
2020-08-04         0       0        0        0    0.0
2020-08-05         0       0        0        0    0.0
2020-08-06         0       0        0        0    0.0
2020-08-07         0       0        0        0    0.0
2020-08-08  ot-group  type-t       NY     LTop   90.0
2020-08-09  ot-group  type-t       NY     LTop   90.0
2020-08-10  ot-group  type-t       NY     LTop   90.0
Wajahat
  • 1,593
  • 3
  • 20
  • 47

2 Answers2

3

You forgot to convert date column as DatetimeIndex:

df = pd.read_csv('data.csv', parse_dates=['date'])
# your code here

Another way with pivot_table:

# Not needed if you used pd.read_csv(..., parse_dates=['date'])
df['date'] = pd.to_datetime(df['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group','rtype','location','hardware']
out = (df.pivot_table(index='date', columns=cols, values='value')
         .reindex(dti.rename('date')).ffill().fillna(0)
         .unstack().rename('value').reset_index())

Output:

>>> out
       group   rtype location hardware       date  value
0   my-group  type-s       NY     DTop 2020-08-01    0.0
1   my-group  type-s       NY     DTop 2020-08-02    0.0
2   my-group  type-s       NY     DTop 2020-08-03    0.0
3   my-group  type-s       NY     DTop 2020-08-04    0.0
4   my-group  type-s       NY     DTop 2020-08-05   10.0
5   my-group  type-s       NY     DTop 2020-08-06   10.0
6   my-group  type-s       NY     DTop 2020-08-07   20.0
7   my-group  type-s       NY     DTop 2020-08-08   20.0
8   my-group  type-s       NY     DTop 2020-08-09   20.0
9   my-group  type-s       NY     DTop 2020-08-10   30.0
10  my-group  type-s       NY    Tower 2020-08-01   40.0
11  my-group  type-s       NY    Tower 2020-08-02   40.0
12  my-group  type-s       NY    Tower 2020-08-03   40.0
13  my-group  type-s       NY    Tower 2020-08-04   40.0
14  my-group  type-s       NY    Tower 2020-08-05   40.0
15  my-group  type-s       NY    Tower 2020-08-06   40.0
16  my-group  type-s       NY    Tower 2020-08-07   50.0
17  my-group  type-s       NY    Tower 2020-08-08   50.0
18  my-group  type-s       NY    Tower 2020-08-09   50.0
19  my-group  type-s       NY    Tower 2020-08-10   50.0
20  ot-group  type-t       NY     LTop 2020-08-01    0.0
21  ot-group  type-t       NY     LTop 2020-08-02    0.0
22  ot-group  type-t       NY     LTop 2020-08-03    0.0
23  ot-group  type-t       NY     LTop 2020-08-04    0.0
24  ot-group  type-t       NY     LTop 2020-08-05    0.0
25  ot-group  type-t       NY     LTop 2020-08-06    0.0
26  ot-group  type-t       NY     LTop 2020-08-07    0.0
27  ot-group  type-t       NY     LTop 2020-08-08   90.0
28  ot-group  type-t       NY     LTop 2020-08-09   90.0
29  ot-group  type-t       NY     LTop 2020-08-10   90.0

However, it could be better to use only pivot_table without unstack:

>>> (df.pivot_table(index='date', columns=cols, values='value')
       .reindex(dti.rename('date')).ffill().fillna(0))

group      my-group       ot-group
rtype        type-s         type-t
location         NY             NY
hardware       DTop Tower     LTop
date                              
2020-08-01      0.0  40.0      0.0
2020-08-02      0.0  40.0      0.0
2020-08-03      0.0  40.0      0.0
2020-08-04      0.0  40.0      0.0
2020-08-05     10.0  40.0      0.0
2020-08-06     10.0  40.0      0.0
2020-08-07     20.0  50.0      0.0
2020-08-08     20.0  50.0     90.0
2020-08-09     20.0  50.0     90.0
2020-08-10     30.0  50.0     90.0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I see. But even converting it to `DatetimeIndex`, the output contains NaN for `group, rtype, location, hardware` columns as your output shows (I am getting the same thing). Can we do something to have the same values as for the other dates i.e. what you are able to get using the `pivot_table` ? I guess maybe using `bfill` ? Otherwise, the `pivot_table` method works I guess. – Wajahat May 25 '23 at 05:47
  • I actually meant using bfill with my version of code (without using `pivot_table`) where NaNs exist in all columns. You original `pivot_table` method's output is actually what I need (I can just fillna() with 0s for NaN in `values` column. – Wajahat May 25 '23 at 06:06
  • Yes, that *almost* works. If you see the expected output in my question, I actually want `0`s for value. `bfill()` places the value for the smallest date instead. Just wondering if there is a way to get the output without using `pivot_table`. – Wajahat May 25 '23 at 06:17
  • Why not simply use: `return df.reindex(date_range).ffill().fillna(0)` in your code? – Corralien May 25 '23 at 06:18
  • Because that will fill zeros in text columns as well which I do not want: ``` group rtype location hardware value 2020-08-01 0 0 0 0 0.0 2020-08-02 0 0 0 0 0.0 2020-08-03 0 0 0 0 0.0 2020-08-04 0 0 0 0 0.0 2020-08-05 my-group type-s NY DTop 10.0 2020-08-06 my-group type-s NY DTop 10.0 2020-08-07 my-group type-s NY DTop 20.0 2020-08-08 my-group type-s NY DTop 20.0 ``` – Wajahat May 25 '23 at 06:23
  • Oh I see: `return df.reindex(date_range).ffill().fillna({'value': 0}).bfill()` – Corralien May 25 '23 at 06:25
  • Curious, is there a way to limit `bfill()` to certain columns? – Wajahat May 25 '23 at 06:28
  • 1
    I guess they can be using https://stackoverflow.com/questions/27012151/forward-fill-specific-columns-in-pandas-dataframe. – Wajahat May 25 '23 at 06:39
0

One option is with complete from pyjanitor:

# pip install pyjanitor
import janitor 
import pandas as pd

df = pd.read_clipboard(parse_dates=['date'])
dti = pd.date_range(df['date'].min(), df['date'].max(), freq='D')
cols = ['group','rtype','location','hardware']

(df
.complete(cols, {'date': dti})
.assign(value = lambda f: f.value.ffill().fillna(0, downcast='infer'))
)

       group   rtype location hardware       date  value
0   my-group  type-s       NY     DTop 2020-08-01      0
1   my-group  type-s       NY     DTop 2020-08-02      0
2   my-group  type-s       NY     DTop 2020-08-03      0
3   my-group  type-s       NY     DTop 2020-08-04      0
4   my-group  type-s       NY     DTop 2020-08-05     10
5   my-group  type-s       NY     DTop 2020-08-06     10
6   my-group  type-s       NY     DTop 2020-08-07     20
7   my-group  type-s       NY     DTop 2020-08-08     20
8   my-group  type-s       NY     DTop 2020-08-09     20
9   my-group  type-s       NY     DTop 2020-08-10     30
10  my-group  type-s       NY    Tower 2020-08-01     40
11  my-group  type-s       NY    Tower 2020-08-02     40
12  my-group  type-s       NY    Tower 2020-08-03     40
13  my-group  type-s       NY    Tower 2020-08-04     40
14  my-group  type-s       NY    Tower 2020-08-05     40
15  my-group  type-s       NY    Tower 2020-08-06     40
16  my-group  type-s       NY    Tower 2020-08-07     50
17  my-group  type-s       NY    Tower 2020-08-08     50
18  my-group  type-s       NY    Tower 2020-08-09     50
19  my-group  type-s       NY    Tower 2020-08-10     50
20  ot-group  type-t       NY     LTop 2020-08-01     50
21  ot-group  type-t       NY     LTop 2020-08-02     50
22  ot-group  type-t       NY     LTop 2020-08-03     50
23  ot-group  type-t       NY     LTop 2020-08-04     50
24  ot-group  type-t       NY     LTop 2020-08-05     50
25  ot-group  type-t       NY     LTop 2020-08-06     50
26  ot-group  type-t       NY     LTop 2020-08-07     50
27  ot-group  type-t       NY     LTop 2020-08-08     90
28  ot-group  type-t       NY     LTop 2020-08-09     90
29  ot-group  type-t       NY     LTop 2020-08-10     90

@corralien's suggestion of a pivot table without an unstack might be a better one, it all depends on your end goal

sammywemmy
  • 27,093
  • 4
  • 17
  • 31