Fill missing dates pyspark

Question

I have a pandas data frame like this

date,userId,classification
2018-03-29,55,Large 
2018-03-30,55, small
2018-03-29,55, x-small
2018-04-20,65, Large 
2018-04-29,75, x-small

How do I fill the missing dates but per userId for a time period of 60 days? I tried it with pandas using indexing the date and then re-indexing and filling it but it gave all null values all other fields. I am ok with any solution using spark dataframes or pandas using python or java.

The code I tried

import pandas as pd

idx = pd.date_range('02-28-2018', '04-29-2018')

df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
 ['Chandler Bing','55','2018-03-29',60],
 ['Chandler Bing','55','2018-03-30',59],
 ['Harry Kane','45','2018-04-30',80],
 ['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])

df['timestamp'] = pd.to_datetime(df['timestamp']) 
df.set_index('timestamp', inplace=True)
print (df)
df= df.reindex(idx, fill_value=0)
print(df)

and the error I get is 'ValueError: cannot reindex from a duplicate axis'

Even this version didn't work

    import pandas as pd

idx = pd.date_range('02-28-2018', '04-29-2018')

df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
 ['Chandler Bing','55','2018-03-29',60],
 ['Chandler Bing','55','2018-03-30',59],
 ['Harry Kane','45','2018-04-30',80],
 ['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])

df['timestamp'] = pd.to_datetime(df['timestamp']) 
pd.DatetimeIndex(df['timestamp'])
del(df['timestamp'])
#df.set_index('timestamp', inplace=True)
print (df)
df= df.reindex(idx, fill_value=0)
print (df)
uniquaccount=df['accountid'].unique()
print(uniquaccount)

You could generate a new data frame with 60 days for each user id, and then join that new data frame with your current spark data frame. — ArunK, Apr 24 '18 at 19:46

score 0 · Answer 1 · edited Apr 24 '18 at 20:26

0

You can use reindex with Pandas series

import pandas as pd

idx = pd.date_range('02-28-2018', '04-29-2018')

s = pd.Series({'2018-03-29' : 55,
                '2018-03-30' : 55,
                '2018-03-29' : 55,
                '2018-04-20' : 65,
                '2018-04-29' :75})

s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

will impute all the missing dates:

2018-03-28     0
2018-03-29    55
2018-03-30    55
2018-03-31     0
2018-04-01     0
2018-04-02     0
2018-04-03     0
2018-04-04     0
...

edited Apr 24 '18 at 20:26

zero323

322,348
103
959
935

answered Apr 24 '18 at 19:58

dportman

1,101
10
20

I don't think that works, it says cannot re-index from a duplicate axis, I am attaching the source to the original post. – Masterbuilder Apr 24 '18 at 21:22

score 0 · Answer 2 · answered Apr 10 '19 at 22:41

Reindexing doesn't work so well with a non-unique index. Instead, create an intermediate dataframe with one row per timestamp/account combination, then merge:

import pandas as pd

idx = pd.date_range('02-28-2018', '04-29-2018')

df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
 ['Chandler Bing','55','2018-03-29',60],
 ['Chandler Bing','55','2018-03-30',59],
 ['Harry Kane','45','2018-04-30',80],
 ['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])

df['timestamp'] = pd.to_datetime(df['timestamp']) 

# Step 1: create an intermediate dataframe with the cartesian product (CROSS JOIN)
#   of all of the timestamps and IDs
idx = pd.Series(idx, name='timestamp').to_frame()
unique_accounts = df[['accountid', 'name']].drop_duplicates()
# Pandas CROSS JOIN, see https://stackoverflow.com/questions/53699012/performant-cartesian-product-cross-join-with-pandas/53699013#53699013
df_intermediate = pd.merge(unique_accounts.assign(dummy=1), idx.assign(dummy=1), on='dummy', how='inner')
df_intermediate = df_intermediate.drop(columns='dummy')

# Step 2: merge with the original dataframe, and fill missing values
df_new = df_intermediate.merge(df.drop(columns='name'), how='left', on=['accountid', 'timestamp'])
df_new['size'] = df_new['size'].fillna(value=0)

Also, consider using a different variable name than "size". size is a reserved name in pandas.

Fill missing dates pyspark

2 Answers2