I'm working with financial data, which is recorded at irregular intervals. Some of the timestamps are duplicates, which is making analysis tricky. This is an example of the data - note there are four 2016-08-23 00:00:17.664193
timestamps:
In [167]: ts
Out[168]:
last last_sz bid ask
datetime
2016-08-23 00:00:14.161128 2170.75 1 2170.75 2171.00
2016-08-23 00:00:14.901180 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.196639 2170.75 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 2 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:26.206108 2170.75 2 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 7 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 1 2170.75 2171.00
In this example, there are only a few duplicates, but in some cases, there are hundreds of consecutive rows, all sharing the same timestamp. I'm aiming to solve this by adding 1 extra nanosecond to each duplicate (so in the case of 4 consecutive identical timestamps, I'd add 1ns to the second, 2ns to the 3rd, and 3ns to the fourth. For example, the data above would be converted to:
In [169]: make_timestamps_unique(ts)
Out[170]:
last last_sz bid ask
newindex
2016-08-23 00:00:14.161128000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:14.901180000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.196639000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:17.664193000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193001 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193002 2171.00 2 2170.75 2171.0
2016-08-23 00:00:17.664193003 2171.00 1 2170.75 2171.0
2016-08-23 00:00:26.206108000 2170.75 2 2170.75 2171.0
2016-08-23 00:00:28.322456000 2170.75 7 2170.75 2171.0
2016-08-23 00:00:28.322456001 2170.75 1 2170.75 2171.0
I've struggled to find a good way to do this - my current solution is to make multiple passes, checking for duplicates each time, and adding 1ns to all but the first in a series of identical timestamps. Here's the code:
def make_timestamps_unique(ts):
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes = 0
while duplicate_count > 0:
ts.loc[:, 'newindex'] = ts.index
ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
ts = ts.set_index('newindex')
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes += 1
print('%d passes of duplication loop' % passes)
return ts
This is obviously quite inefficient - it often requires hundreds of passes, and if I try it on a 2 million row dataframe, I get a MemoryError
. Any ideas for a better way to achieve this?