I have a pandas data frame like this
date,userId,classification 2018-03-29,55,Large 2018-03-30,55, small 2018-03-29,55, x-small 2018-04-20,65, Large 2018-04-29,75, x-small
How do I fill the missing dates but per userId for a time period of 60 days? I tried it with pandas using indexing the date and then re-indexing and filling it but it gave all null values all other fields. I am ok with any solution using spark dataframes or pandas using python or java.
The code I tried
import pandas as pd
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
['Chandler Bing','55','2018-03-29',60],
['Chandler Bing','55','2018-03-30',59],
['Harry Kane','45','2018-04-30',80],
['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)
print (df)
df= df.reindex(idx, fill_value=0)
print(df)
and the error I get is 'ValueError: cannot reindex from a duplicate axis'
Even this version didn't work
import pandas as pd
idx = pd.date_range('02-28-2018', '04-29-2018')
df = pd.DataFrame([['Chandler Bing','55','2018-03-29',51],
['Chandler Bing','55','2018-03-29',60],
['Chandler Bing','55','2018-03-30',59],
['Harry Kane','45','2018-04-30',80],
['Harry Kane','45','2018-04-21',90]],columns=['name','accountid','timestamp','size'])
df['timestamp'] = pd.to_datetime(df['timestamp'])
pd.DatetimeIndex(df['timestamp'])
del(df['timestamp'])
#df.set_index('timestamp', inplace=True)
print (df)
df= df.reindex(idx, fill_value=0)
print (df)
uniquaccount=df['accountid'].unique()
print(uniquaccount)