6

I have a df with over hundreds of millions of rows.

     latitude                    longitude                   time                    VAL     
0   -39.20000076293945312500    140.80000305175781250000    1972-01-19 13:00:00     1.20000004768371582031  
1   -39.20000076293945312500    140.80000305175781250000    1972-01-20 13:00:00     0.89999997615814208984 
2   -39.20000076293945312500    140.80000305175781250000    1972-01-21 13:00:00     1.50000000000000000000 
3   -39.20000076293945312500    140.80000305175781250000    1972-01-22 13:00:00     1.60000002384185791016 
4   -39.20000076293945312500    140.80000305175781250000    1972-01-23 13:00:00     1.20000004768371582031
... ...

It contains a time column with the type of datetime64 in UTC. The following code is to create a new column isInDST to indicate if the time is in daylight saving period in a local time zone.

df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)

It takes about 400 seconds to process 15,223,160 rows.

Is there a better approach to achieve this with better performance? Is vectorize a better way?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
alextc
  • 3,206
  • 10
  • 63
  • 107
  • 2
    Can you try using `df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]` – Amit Vikram Singh Apr 08 '21 at 00:57
  • Thanks @AmitVikramSingh. Your code took 810 seconds to process 103,621,920 rows. I have 148*244*424680 = 15,336,044,160 records. Given this is a linear processing it would take more than 30 hours to process the whole data. – alextc Apr 08 '21 at 02:11
  • https://stackoverflow.com/questions/52467759/flag-daylight-saving-time-dst-hours-in-pandas-date-time-column – DYZ Apr 08 '21 at 05:02
  • maybe something using `utcoffset()` could be slightly faster? – filippo Apr 08 '21 at 08:35
  • How about comparing `df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').map(pd.Timestamp.dst) > datetime.timedelta()` ? – Oluwafemi Sule Apr 09 '21 at 05:20

1 Answers1

2

All results are calculated on 1M datapoints.

Cython + np.vectorize

7.2 times faster than the original code

%%cython
from cpython.datetime cimport datetime
cpdef bint c_is_in_dst(datetime dt):
    return dt.dst().total_seconds() != 0 

%%timeit
df['isInDST'] = np.vectorize(c_is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())

1.08 s ± 10.2 ms per loop per loop

np.vectorize

6.5 times faster than the original code

def is_in_dst(dt):
    return dt.dst().total_seconds() != 0 

%%timeit
df['isInDST'] = np.vectorize(is_in_dst)(df['time'].dt.tz_localize('UTC').dt.tz_convert('Australia/Victoria').dt.to_pydatetime())

1.2 s ± 29.3 ms per loop per loop

Based on the documentation (The implementation is essentially a for loop) I expected the result to be the same as for the list comprehension, but it's consistently a little bit better than list comprehension.

List comprehension

5.9 times faster than the original code

%%timeit
df['isInDST'] = [x.dst().total_seconds()!=0 for x in pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria')]

1.33 s ± 48.4 ms per loop

This result shows that pandas map/apply is very slow, it adds additional overhead that can be eliminated by just using a python for loop.

Original approach (map on pandas DatetimeIndex)

%%timeit
df['isInDST'] = pd.DatetimeIndex(df['time']).tz_localize('UTC').tz_convert('Australia/Victoria').map(lambda x : x.dst().total_seconds()!=0)

7.82 s ± 84.3 ms per loop

Tested on 1M rows of dummy data

N = 1_000_000
df = pd.DataFrame({"time": [datetime.datetime.now().replace(hour=random.randint(0,23),minute=random.randint(0,59)) for _ in range(N)]})

Also, run the code on 100K and 10M rows - the results are linearly dependant on the number of rows

PermanentPon
  • 702
  • 5
  • 10