7

I'm working with financial data, which is recorded at irregular intervals. Some of the timestamps are duplicates, which is making analysis tricky. This is an example of the data - note there are four 2016-08-23 00:00:17.664193 timestamps:

In [167]: ts
Out[168]: 
                               last  last_sz      bid      ask
datetime                                                      
2016-08-23 00:00:14.161128  2170.75        1  2170.75  2171.00
2016-08-23 00:00:14.901180  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.196639  2170.75        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        2  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:26.206108  2170.75        2  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        7  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        1  2170.75  2171.00

In this example, there are only a few duplicates, but in some cases, there are hundreds of consecutive rows, all sharing the same timestamp. I'm aiming to solve this by adding 1 extra nanosecond to each duplicate (so in the case of 4 consecutive identical timestamps, I'd add 1ns to the second, 2ns to the 3rd, and 3ns to the fourth. For example, the data above would be converted to:

In [169]: make_timestamps_unique(ts)
Out[170]:
                                  last  last_sz      bid     ask
newindex                                                        
2016-08-23 00:00:14.161128000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:14.901180000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.196639000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:17.664193000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193001  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193002  2171.00        2  2170.75  2171.0
2016-08-23 00:00:17.664193003  2171.00        1  2170.75  2171.0
2016-08-23 00:00:26.206108000  2170.75        2  2170.75  2171.0
2016-08-23 00:00:28.322456000  2170.75        7  2170.75  2171.0
2016-08-23 00:00:28.322456001  2170.75        1  2170.75  2171.0

I've struggled to find a good way to do this - my current solution is to make multiple passes, checking for duplicates each time, and adding 1ns to all but the first in a series of identical timestamps. Here's the code:

def make_timestamps_unique(ts):
    mask = ts.index.duplicated('first')
    duplicate_count = np.sum(mask)
    passes = 0

    while duplicate_count > 0:
        ts.loc[:, 'newindex'] = ts.index
        ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
        ts = ts.set_index('newindex')
        mask = ts.index.duplicated('first')
        duplicate_count = np.sum(mask)
        passes += 1

    print('%d passes of duplication loop' % passes)
    return ts

This is obviously quite inefficient - it often requires hundreds of passes, and if I try it on a 2 million row dataframe, I get a MemoryError. Any ideas for a better way to achieve this?

3 Answers3

7

Here is a faster numpy version (but little less readable) which is inspired from this SO article. The idea is to use cumsum on duplicated timestamp values while resetting the cumulative sum each time a np.NaN is encountered:

# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN

missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff

# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)

DatetimeIndex([   '2016-08-23 00:00:14.161128',
                  '2016-08-23 00:00:14.901180',
                  '2016-08-23 00:00:17.196639',
               '2016-08-23 00:00:17.664193001',
               '2016-08-23 00:00:17.664193002',
               '2016-08-23 00:00:17.664193003',
               '2016-08-23 00:00:17.664193004',
                  '2016-08-23 00:00:26.206108',
               '2016-08-23 00:00:28.322456001',
               '2016-08-23 00:00:28.322456002'],
              dtype='datetime64[ns]', name='datetime', freq=None)

Timing this solution yields 10000 loops, best of 3: 107 µs per loop whereas the @DYZ groupby/apply approach (but more readable) is roughly 50 times slower on the dummy data with 100 loops, best of 3: 5.3 ms per loop.

Of course, you have to reset your index, finally:

df.index = result
Community
  • 1
  • 1
pansen
  • 6,433
  • 4
  • 19
  • 32
  • 2
    Just one note - I changed `keep=False` to `keep='first'` (in the first line) in order to get the same result as my example and the @DYZ solution – strongvigilance Apr 09 '17 at 09:40
  • @strongvigilance You're correct - this modification yields your desired result. – pansen Apr 09 '17 at 16:19
  • Also note that your dataframe needs to be sorted by index before applying this - may be obvious, but took me a while to figure out what was wrong. – Melissa Sep 26 '17 at 19:53
5

You can group the rows by the index and then add a range of sequential timedeltas to the index of each group. I am not sure if this can be done directly with the index, but you can first convert the index to an ordinary column, apply the operation to the column, and set the column as the index again:

newindex = ts.reset_index()\
             .groupby('datetime')['datetime']\
             .apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex
DYZ
  • 55,249
  • 10
  • 64
  • 93
1

Lets start with a vectorized benchmark since you are dealing with 1M+ rows this should be a priority:

%timeit do
10000000 loops, best of 3: 20.5 ns per loop

Lets make some test data since none was provided:

rng = pd.date_range('1/1/2011', periods=72, freq='H')

df = pd.DataFrame(dict(time = rng))

Duplicate the timestamps:

df =pd.concat((df, df))
df =df.sort()

df
Out [296]:
                  time
0  2011-01-01 00:00:00
0  2011-01-01 00:00:00
1  2011-01-01 01:00:00
1  2011-01-01 01:00:00
2  2011-01-01 02:00:00
2  2011-01-01 02:00:00
3  2011-01-01 03:00:00
3  2011-01-01 03:00:00
4  2011-01-01 04:00:00
4  2011-01-01 04:00:00
5  2011-01-01 05:00:00
5  2011-01-01 05:00:00
6  2011-01-01 06:00:00
6  2011-01-01 06:00:00
7  2011-01-01 07:00:00
7  2011-01-01 07:00:00
8  2011-01-01 08:00:00
8  2011-01-01 08:00:00
9  2011-01-01 09:00:00
9  2011-01-01 09:00:00

Find the locations where the difference in time from the previous row is 0 seconds

mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')

mask
Out [307]:
0     False
0      True
1     False
1      True
2     False
2      True
3     False
3      True
4     False
4      True
5     False

Offset these locations : in this case I chose milliseconds

df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5))

Out [309]:
                      time
0  2011-01-01 00:00:00.000
0  2011-01-01 00:00:00.005
1  2011-01-01 01:00:00.000
1  2011-01-01 01:00:00.005
2  2011-01-01 02:00:00.000
2  2011-01-01 02:00:00.005
3  2011-01-01 03:00:00.000
3  2011-01-01 03:00:00.005
4  2011-01-01 04:00:00.000
4  2011-01-01 04:00:00.005
5  2011-01-01 05:00:00.000

EDIT: With consecutive timestamps [This assumes 4]

consect = 4 
for i in range(4):
    mask = (df.time-df.time.shift(consect)) == np.timedelta64(0,'s')
    df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5+i))
    consect -= 1
SerialDev
  • 2,777
  • 20
  • 34
  • Surely this won't work in situations such as my example where there are more than 2 consecutive identical timestamps? – strongvigilance Apr 08 '17 at 17:47
  • iterate modyfiying the shift so that it does it in batches of all similar ones that it has seen, just tested with 4 consecutive ones, still should work fine – SerialDev Apr 08 '17 at 17:52