2

My data is STRAVA activities in a dataframe with the index set as the date of the activity. I want to insert rows that are indexed with the date that is missing i.e. my dataframe would be indexed at a frequency of days from oldest to newest in the original data. I have tried the following methods from two other posts here Add missing dates to pandas dataframe and here pandas fill missing dates in time series

However the issue I run into is as follows. Because on some dates two activities occur the index label for the row no longer is unique and is duplicated.

2020-12-20  8.00000  
2020-12-20  7.00000
2020-12-22  6.50000
2020-12-22  9.00000
2020-12-23  4.00000
2020-12-24  3.00000

So when I run df.reindex(index=pd.date_range(df.index.min(),df.index.max(), fill=0, I get the dates inserted but lose the duplicates. Similarly with df.index.asfreq("D"). I get the same issue.

The only solution I have found is using df.align() as parsing in a series with index labels at frequencies of days as below. Then I return the first dataframe in the tuple producing the following result.

dates = pd.date_range(df.index.min(), df.index.max(), freq = "D")
ts = pd.Series(np.random.randint(0,500), len(dates)), index=dates)
tuple = df.align(ts, axis=0)
newdf = tuple[0]

2020-12-20  8.00000  
2020-12-20  7.00000
2020-12-20  NaN
2020-12-21  NaN
2020-12-22  6.50000
2020-12-22  9.00000
2020-12-22  NaN
2020-12-23  4.00000
2020-12-23 NaN
2020-12-24  3.00000
2020-12-24 NaN

Is this the only solution? Is there a way to ignore duplicate indexes but continue to insert rows with dates that are MISSING and therefore not end up with even more duplicate date indexs?

1 Answers1

1

You could make your date series into a dataframe and to a left merge.

import pandas as pd

dates = pd.date_range('2020-12-20', '2020-12-24', freq = "D").to_frame(name='date')

ts = pd.DataFrame({'date': {0: '2020-12-20',
  1: '2020-12-20',
  2: '2020-12-22',
  3: '2020-12-22',
  4: '2020-12-23',
  5: '2020-12-24'},
 'value': {0: 8.0, 1: 7.0, 2: 6.5, 3: 9.0, 4: 4.0, 5: 3.0}})


ts['date'] = pd.to_datetime(ts['date'])


dates.merge(ts, on='date', how='left')

Output

        date  value
0 2020-12-20    8.0
1 2020-12-20    7.0
2 2020-12-21    NaN
3 2020-12-22    6.5
4 2020-12-22    9.0
5 2020-12-23    4.0
6 2020-12-24    3.0
Chris
  • 15,819
  • 3
  • 24
  • 37
  • 1
    Thank you Chris, I have used this method and it works well. For others trying to reproduce the most important thing is to make sure that the "date" columns are the same dtype before merging! otherwise the merge fills other columns with NaN data! The example above shows this explicitly in the ts["date"] = pd.to_datetime(ts["date"]) – hlhorsnellphd Feb 12 '21 at 14:35