I would like to fill the missing values of the dates of a Pandas dataframe, but instead of filling the missing date based only on the date column, I would like to do it based on more than 1 column. In this case, the column source
.
The example is the following
Original
date_found source count_unique_uuids count_unique_uuids_raw
2021-05-13 source_1 20 20
2021-05-14 source_2 1829 1829
2021-05-14 source_3 2245 2245
2021-05-14 source_1 40 40
2021-05-15 source_1 903 903
2021-05-16 source_2 20 20
2021-05-18 source_3 89 89
Desired dataset
date_found source count_unique_uuids count_unique_uuids_raw
2021-05-13 source_1 20 20
2021-05-13 source_2 0 0
2021-05-13 source_3 0 0
2021-05-14 source_1 40 40
2021-05-14 source_2 1829 1829
2021-05-14 source_3 2245 2245
2021-05-15 source_1 903 903
2021-05-15 source_2 0 0
2021-05-15 source_3 0 0
2021-05-16 source_1 0 0
2021-05-16 source_2 20 20
2021-05-16 source_3 0 0
2021-05-17 source_1 0 0
2021-05-17 source_2 0 0
2021-05-17 source_3 0 0
2021-05-18 source_1 0 0
2021-05-18 source_2 0 0
2021-05-18 source_3 89 89
I was using reindex and resample as a reference to build the dataset
Reindex
: Add missing dates to pandas dataframe
Resample
: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
I'm trying the following code
But getting an error
ValueError: StringArray requires a sequence of strings or pandas.NA
def add_missing_dates(df: pd.DataFrame) -> pd.DataFrame:
df['date_found'] = pd.to_datetime(df['date_found'], format='%Y-%m-%d')
min_date = df['date_found'].min()
max_date = df['date_found'].max()
(df.set_index(['date_found'])
.groupby(['source'], as_index=False, group_keys=False)
.apply(lambda x: x.reindex(pd.date_range(min_date, max_date)))
.reset_index().rename(columns={'index': 'date_found'})
.fillna(0)
)
return df
def add_dates_to_source(df: pd.DataFrame, source: str = 'source') -> pd.DataFrame:
sources = df[source].tolist()
dfs_to_concat = []
for source_value in sources:
filtered_df = df.loc[df[source] == source_value]
df_ = add_missing_dates(filtered_df)
df_[source] = source_value
dfs_to_concat.append(df_)
return pd.concat(dfs_to_concat)
To run
df = add_dates_to_source(df)