3

I have a DataFrame with about 20 million rows and a DatetimeIndex. There are data from different years, and I would like to assign them all to the same year by changing the timestamps. The statements below accomplish this, but are a bit slower than I would like (double-digit seconds).

new_index = [ts.replace(year=2014) for ts in df.index]
df.index = new_index

The data are not evenly spaced, so I cannot generate a new index easily. Is there a better way?

adr
  • 1,731
  • 10
  • 18

3 Answers3

4

Try with:

%%time
new_index = pd.to_datetime({
    'year': 2014,
    'month': df.index.month,
    'day': df.index.day})

CPU times: user 333 ms, sys: 34.4 ms, total: 367 ms
Wall time: 346 ms

Compared to the original:

%%time
new_index = [ts.replace(year=2014) for ts in df.index]

CPU times: user 6.97 s, sys: 115 ms, total: 7.08 s
Wall time: 7.1 s

The timings are for 1M dataset, but I would expect a similar improvement for 20M.

Also, of course, if hours/minutes/seconds are to be preserved, they should be added to to_datetime as well.

perl
  • 9,826
  • 1
  • 10
  • 22
0

Please try below:

import datetime as dt
df.index = df.index.to_series().apply(lambda x: dt.datetime.strftime(x, '2016-%m-%d %H:%M:%S')).tolist()

In above example, I am trying change year to 2016 and output is as below:

df

    open    high    low     close
2016-01-02 09:08:00     116.00  116.00  116.00  116.00
2016-01-02 09:16:00     116.10  117.80  117.00  113.00
2016-01-03 09:07:00     115.50  116.20  115.50  116.20
2016-01-02 09:19:00     116.00  116.00  115.60  115.75
2016-01-02 09:18:00     116.05  116.35  116.00  116.00
Dhiraj Bansal
  • 417
  • 3
  • 8
0

You can try,

df.index = pd.DatetimeIndex(df.index)
df.index = df.index + pd.DateOffset(year=2016)
Rajesh
  • 766
  • 5
  • 17