1

Trying to get the unixtimestamp from millions of bytes objects

Using this

import datetime 
dt_bytes = b'2019-05-23 09:37:56.362965'
#fmt = '%m/%d/%Y %H:%M:%S.%f'
fmt = '%Y-%m-%d %H:%M:%S.%f'
dt_ts = datetime.datetime.strptime(dt_bytes.decode('utf-8'), fmt)
unix_ts = dt_ts.timestamp()

works perfect:

In [82]: unix_ts                                                                                                             
Out[82]: 1558604276.362965

But the decode('utf-8') is cutting the flow rate in half (from 38k/sec to 20k/sec).

So is there a way to get the unixtimestamp from a bytes input instead of a str input?

__UPDATE:__

I found out that the bottleneck is datetime.datetime.strptime(..), so I switched to np.datetime64 (see below)

__UPDATE 2:__ Check the accepted answer below to get a good performance benchmark of different approaches.

gies0r
  • 4,723
  • 4
  • 39
  • 50
  • Try `decode('ascii')` and see if that's faster. If the input only contains dates and times, it shouldn't need the full Unicode character set. – Mark Ransom May 06 '20 at 19:38
  • @MarkRansom Sadly this is not the case - The runtimes are around equal (19.850 per second). – gies0r May 06 '20 at 19:51
  • Ok I will delete the question, because `dt_bytes.decode('ascii')` is demanding almost no runtime. The `datetime.datetime.strptime(..)` method **slows** it down - not the `decode` – gies0r May 06 '20 at 19:55
  • I've been wracking my brain for a faster way to convert to string and I can't think of one. The best I could come up with is `''.join(chr(b) for b in dt_bytes)` but I doubt that will be faster. – Mark Ransom May 06 '20 at 19:56
  • You could try `dateutil` to parse the date, it might be faster. – Mark Ransom May 06 '20 at 19:56
  • 1
    check out [A faster strptime?](https://stackoverflow.com/questions/13468126/a-faster-strptime) - explicit conversion of parts of the string to int might look ugly but if you have a fixed string format coming in, this is much faster than `strptime`. Won't get around decoding though. – FObersteiner May 06 '20 at 20:10

2 Answers2

3

Let's first assume you have strings in ISO format, '%Y-%m-%dT%H:%M:%S.%f', in a list (let's also not consider decoding from byte array for now):

from datetime import datetime, timedelta
base, n = datetime(2000, 1, 1, 1, 2, 3, 420001), 1000
datelist = [(base + timedelta(days=i)).isoformat(' ') for i in range(n)]
# datelist
# ['2000-01-01 01:02:03.420001'
# ...
# '2002-09-26 01:02:03.420001']

from string to datetime object

Let's define some functions that parse string to datetime, using different methods:

import re
import numpy as np

def strp_isostr(l):
    return list(map(datetime.fromisoformat, l))

def isostr_to_nparr(l):
    return np.array(l, dtype=np.datetime64)

def split_isostr(l):
    def splitter(s):
        tmp = s.split(' ')
        tmp = tmp[0].split('-') + [tmp[1]]
        tmp = tmp[:3] + tmp[3].split(':')
        tmp = tmp[:5] + tmp[5].split('.')
        return datetime(*map(int, tmp))
    return list(map(splitter, l))

def resplit_isostr(l):
    # return list(map(lambda s: datetime(*map(int, re.split('T|-|\:|\.', s))), l))
    return [datetime(*map(int, re.split('\ |-|\:|\.', s))) for s in l]

def full_stptime(l):
    # return list(map(lambda s: datetime.strptime(s, '%Y-%m-%dT%H:%M:%S.%f'), l))
    return [datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f') for s in l]

If I run %timeit in the IPython console for these functions on my machine, I get

%timeit strp_isostr(datelist)
98.2 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit isostr_to_nparr(datelist)
1.49 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit split_isostr(datelist)
3.02 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit resplit_isostr(datelist)
3.8 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit full_stptime(datelist)
16.7 ms ± 780 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So we can conclude that the built-in datetime.fromisoformat is by far the fastest option for the 1000-element input. However, this assumes you want a list to work with. In case you need an np.array of datetime64 anyway, going straight to that seems like the best option.


third party option: ciso8601

If you're able to install additional packages, ciso8601 is worth a look:

import ciso8601
def ciso(l):
    return list(map(ciso8601.parse_datetime, l))

%timeit ciso(datelist)
138 µs ± 1.83 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

from datetime object to seconds since the epoch

Looking at the conversion from datetime object to POSIX timestamp, using the most obvious datetime.timestamp method seems to be the most efficient:

import time
def dt_ts(l):
    return list(map(datetime.timestamp, l))

def timetup(l):
    return list(map(time.mktime, map(datetime.timetuple, l)))

%timeit dt_ts(strp_isostr(datelist))
572 µs ± 4.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit timetup(strp_isostr(datelist))
1.44 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Wow - Thanks for the comprehensive answer! – gies0r May 07 '20 at 08:29
  • @gies0r, glad if it helps! I just noticed I forgot to include the conversion to POSIX timestamp, that you had in mind in your question. Sorry for that ;-) Will have a look into this as well when I have the time. – FObersteiner May 07 '20 at 08:36
  • 1
    @gies0r: made an edit to include the conversion to POSIX timestamp. – FObersteiner May 07 '20 at 19:43
  • It is a very complete answer - More than one could expect. Thank you again! Hopefully other people will benefit as well. – gies0r May 08 '20 at 22:45
1

I am moving to numpy.datetime64 because it has less delay than datetime.strptime

import numpy as np

# This is the format np.datetime64 needs:
#np.datetime64('2002-06-28T01:00:00.000000000+0100')

dt_bytes = b'2019-05-23 09:37:56.362965'
#dt_bytes_for_np = dt_bytes.split(b' ')[0] + b'T' + dt_bytes.split(b' ')[1]
dt_bytes_for_np = dt_bytes.replace(b' ', b'T')
ts = np.datetime64(dt_bytes_for_np)

And getting the unixtimestamp (this adds a bit of latency, but still way better than datetime.strptime:

ts.astype('datetime64[ns]').astype('float') / 1000000000
1558604276.362965
gies0r
  • 4,723
  • 4
  • 39
  • 50