I have a dataframe with multiple string columns, one date column and one int value column.
I want to ffill
the missing dates for each group of text columns. The missing dates are all dates from the min date to max date in the dataframe. I think this is better explained using an example.
Sample Input:
group rtype location hardware date value
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-07 50
ot-group type-t NY LTop 2020-08-08 90
Min and Max date for this dataframe: (start_date) 2020-08-01 - (end_date) 2020-08-10
Sample Output:
group rtype location hardware date value
my-group type-s NY DTop 2020-08-01 0
my-group type-s NY DTop 2020-08-02 0
my-group type-s NY DTop 2020-08-03 0
my-group type-s NY DTop 2020-08-04 0
my-group type-s NY DTop 2020-08-05 10
my-group type-s NY DTop 2020-08-06 10
my-group type-s NY DTop 2020-08-07 20
my-group type-s NY DTop 2020-08-08 20
my-group type-s NY DTop 2020-08-09 20
my-group type-s NY DTop 2020-08-10 30
my-group type-s NY Tower 2020-08-01 40
my-group type-s NY Tower 2020-08-02 40
my-group type-s NY Tower 2020-08-03 40
my-group type-s NY Tower 2020-08-04 40
my-group type-s NY Tower 2020-08-05 40
my-group type-s NY Tower 2020-08-06 40
my-group type-s NY Tower 2020-08-07 50
my-group type-s NY Tower 2020-08-08 50
my-group type-s NY Tower 2020-08-09 50
my-group type-s NY Tower 2020-08-10 50
ot-group type-t NY LTop 2020-08-01 0
ot-group type-t NY LTop 2020-08-02 0
ot-group type-t NY LTop 2020-08-03 0
ot-group type-t NY LTop 2020-08-04 0
ot-group type-t NY LTop 2020-08-05 0
ot-group type-t NY LTop 2020-08-06 0
ot-group type-t NY LTop 2020-08-07 0
ot-group type-t NY LTop 2020-08-08 90
ot-group type-t NY LTop 2020-08-09 90
ot-group type-t NY LTop 2020-08-10 90
In this example, I kept the location fixed to avoid an extra long output.
I am able to get the dates I want using pd.date_range()
.
I tried using resample
with multiindex but I run into errors (similar to this).
I tried the approach mentioned in this answer but it doesn't seem to work:
My code using:
import pandas as pd
df = pd.read_csv('data.csv')
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill()
df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True)
print(df.to_string())
Output of this code:
10 DatetimeIndex(['2020-08-01', '2020-08-02', '2020-08-03', '2020-08-04',
'2020-08-05', '2020-08-06', '2020-08-07', '2020-08-08',
'2020-08-09', '2020-08-10'],
dtype='datetime64[ns]', freq='D')
group rtype location hardware value
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
2020-08-01 NaN NaN NaN NaN NaN
2020-08-02 NaN NaN NaN NaN NaN
2020-08-03 NaN NaN NaN NaN NaN
2020-08-04 NaN NaN NaN NaN NaN
2020-08-05 NaN NaN NaN NaN NaN
2020-08-06 NaN NaN NaN NaN NaN
2020-08-07 NaN NaN NaN NaN NaN
2020-08-08 NaN NaN NaN NaN NaN
2020-08-09 NaN NaN NaN NaN NaN
2020-08-10 NaN NaN NaN NaN NaN
Can someone help please?
EDIT:
After fixing the DatetimeIndex issue, and using fillna(0)
:
df = pd.read_csv('data.csv', parse_dates=['date'])
df.set_index('date', inplace=True)
date_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
print(len(date_range), date_range)
def reindex_by_date(df):
return df.reindex(date_range).ffill().fillna(0)
df = df.groupby(['group','rtype','location','hardware']).apply(reindex_by_date).reset_index([0,1,2,3], drop=True).reset_index().rename(columns={'index': 'date'})
print(df.to_string())
Output:
group rtype location hardware value
2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 my-group type-s NY DTop 10.0
2020-08-06 my-group type-s NY DTop 10.0
2020-08-07 my-group type-s NY DTop 20.0
2020-08-08 my-group type-s NY DTop 20.0
2020-08-09 my-group type-s NY DTop 20.0
2020-08-10 my-group type-s NY DTop 30.0
2020-08-01 my-group type-s NY Tower 40.0
2020-08-02 my-group type-s NY Tower 40.0
2020-08-03 my-group type-s NY Tower 40.0
2020-08-04 my-group type-s NY Tower 40.0
2020-08-05 my-group type-s NY Tower 40.0
2020-08-06 my-group type-s NY Tower 40.0
2020-08-07 my-group type-s NY Tower 50.0
2020-08-08 my-group type-s NY Tower 50.0
2020-08-09 my-group type-s NY Tower 50.0
2020-08-10 my-group type-s NY Tower 50.0
2020-08-01 0 0 0 0 0.0
2020-08-02 0 0 0 0 0.0
2020-08-03 0 0 0 0 0.0
2020-08-04 0 0 0 0 0.0
2020-08-05 0 0 0 0 0.0
2020-08-06 0 0 0 0 0.0
2020-08-07 0 0 0 0 0.0
2020-08-08 ot-group type-t NY LTop 90.0
2020-08-09 ot-group type-t NY LTop 90.0
2020-08-10 ot-group type-t NY LTop 90.0