I have a very large pandas dataframe and I would like to create a column that contains the time in seconds since the epoch for a ISO-8601 format date string.
I originally used the standard Python libraries for this but the result is quite slow. I have tried to replace this by using the POSIX c library functions strptime
and mktime
directly but have not been able to get the right answer for the time conversion.
Here is the code (to be run in an IPython window)
%load_ext cythonmagic
%%cython
from posix.types cimport time_t
cimport numpy as np
import numpy as np
import time
cdef extern from "sys/time.h" nogil:
struct tm:
int tm_sec
int tm_min
int tm_hour
int tm_mday
int tm_mon
int tm_year
int tm_wday
int tm_yday
int tm_isdst
time_t mktime(tm *timeptr)
char *strptime(const char *s, const char *format, tm *tm)
cdef to_epoch_c(const char *date_text):
cdef tm time_val
strptime(date_text, "%Y-%m-%d", &time_val)
return <unsigned int>mktime(&time_val)
cdef to_epoch_py(const char *date_text):
return np.uint32(time.mktime(time.strptime(date_text, "%Y-%m-%d")))
cpdef np.ndarray[unsigned int] apply_epoch_date_c(np.ndarray col_date):
cdef Py_ssize_t i, n = len(col_date)
cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
for i in range(len(col_date)):
res[i] = to_epoch_c(col_date[i])
return res
cpdef np.ndarray[unsigned int] apply_epoch_date_py(np.ndarray col_date):
cdef Py_ssize_t i, n = len(col_date)
cdef np.ndarray[unsigned int] res = np.empty(n, dtype=np.uint32)
for i in range(len(col_date)):
res[i] = to_epoch_py(col_date[i])
return res
The struct created by strptime
does not look right to me the hours, minutes and seconds values are way too big, removing them or setting them to 0 does not seem to get the answer I'm looking for.
Here is a small test df which shows the values are not right for the c method:
from pandas import DataFrame
test = DataFrame({'date_text':["2015-05-18" for i in range(3)]}, dtype=np.uint32)
apply_epoch_date_py(test['date_text'].values)
Output: array([1431903600, 1431903600, 1431903600], dtype=uint32)
apply_epoch_date_c(test['date_text'].values)
Output: array([4182545380, 4182617380, 4182602980], dtype=uint32)
I don't get why the values for the c version are not always the same and so far off what they should be. I hope the mistake is reasonably small because the time difference between these two on a large dataframe is substantial (I'm not sure how much less work the c version is doing now as it is not working as expected)
test_large = DataFrame({'date_text':["2015-05-18" for i in range(int(10e6))]}, dtype=np.uint32)
%timeit -n 1 -r 1 apply_epoch_date_py(test_large['date_text'].values)
Output: 1 loops, best of 1: 1min 58s per loop
%timeit apply_epoch_date_c(test_large['date_text'].values)
Output: 1 loops, best of 3: 5.59 s per loop
I have looked up this cython time.h post and a general c unix time from string creation post which may be useful to someone answering.
My main question therefore is about the function to_epoch_c
why is this function producing incorrect values? Thanks
Update:
The method from @Jeff is indeed the fastest and simplest approach to solving this problem using pandas.
The performance of the strptime/mktime in Python in poor in comparison to the other methods. The other Python based method mentioned here is much faster. Running the conversion for all the methods mentioned in this post (plus pd.to_datetime
with string format given) provides interesting results. Pandas with infer_datetime_format is easily the fastest, scaling very well. Somewhat unintuitively if you tell pandas what the date format is it is much slower.
Profile comparison of both pandas methods:
%prun -l 3 pd.to_datetime(df['date_text'],infer_datetime_format=True, box=False).values.view('i8')/10**9
352 function calls (350 primitive calls) in 0.021 seconds
Ordered by: internal time
List reduced from 96 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.013 0.013 0.013 0.013 {pandas.tslib.array_to_datetime}
1 0.005 0.005 0.005 0.005 {pandas.lib.isnullobj}
1 0.001 0.001 0.021 0.021 <string>:1(<module>)
%prun -l 3 pd.to_datetime(df['date_text'],format="%Y-%m-%d", box=False).values.view('i8')/10**9
109 function calls (107 primitive calls) in 0.253 seconds
Ordered by: internal time
List reduced from 55 to 3 due to restriction <3>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.251 0.251 0.251 0.251 {pandas.tslib.array_strptime}
1 0.001 0.001 0.253 0.253 <string>:1(<module>)
1 0.000 0.000 0.252 0.252 tools.py:176(to_datetime)