2

I have a decently large data set nothing crazy, but when I run below code to convert to date time, any reason why the cells seems to run for a very long amount of time? Any way I can improve code for performance?

Code:

df["created_at"] = pd.to_datetime(df["timestamp"]).dt.strftime('%Y-%m-%d %H:%M:%S')

where the timstamp column value originally looks like below

Wed Nov 22 08:31:24 +0000 2017  

Thanks

Chris90
  • 1,868
  • 5
  • 20
  • 42
  • which version of pandas are you using? with version 1.1.2, I cannot reproduce this, at least with artificial data. the conversion is pretty fast, no matter how I convert to datetime (auto-detect of the format or setting a specific format). – FObersteiner Sep 23 '20 at 05:57
  • Hi I have version '1.1.2' – Chris90 Sep 23 '20 at 06:07
  • ok did you benchmark the individual steps, I mean parsing to datetime and from there to string? it seems to me that the conversion back to string is taking some time... using `.astype(str)` or `.apply(pd.Timestamp.isoformat)` might be faster than strftime. – FObersteiner Sep 23 '20 at 06:16

2 Answers2

3

Yeah, so Panda's to_datetime method is notoriously slow if you don't explicitly specify the format. Here is the docs for the method, and here is the standard for how the format should be set up.

I don't know how exactly your time data is set up, but this should set you on the right path:

import pandas as pd


# Weekday as abbreviated name "%a"
df = pd.DataFrame(["Wed"], columns = ["timestamp"])
df["created_at"] = pd.to_datetime(df["timestamp"], format="%a")
print(df)

# Month as abbreviated "%b"
df = pd.DataFrame(["Wed Nov"], columns = ["timestamp"])
df["created_at"] = pd.to_datetime(df["timestamp"], format="%a %b")
print(df)

# Day with zero-padded decimal "%d"
df = pd.DataFrame(["Wed Nov 22"], columns = ["timestamp"])
df["created_at"] = pd.to_datetime(df["timestamp"], format="%a %b %d")
print(df)

# Time as hour:minute:second "%H:%M:%S"
df = pd.DataFrame(["Wed Nov 22 08:31:24"], columns = ["timestamp"])
df["created_at"] = pd.to_datetime(df["timestamp"], format="%a %b %d %H:%M:%S")
print(df)

# UTC offset (%z)
df = pd.DataFrame(["Wed Nov 22 08:31:24 +0000"], columns = ["timestamp"])
df["created_at"] = pd.to_datetime(df["timestamp"], format="%a %b %d %H:%M:%S %z") 
print(df)

# Year is "%Y"
df = pd.DataFrame(["Wed Nov 22 08:31:24 +0000 2017"], columns = ["timestamp"])
df["created_at"] = pd.to_datetime(df["timestamp"], format="%a %b %d %H:%M:%S %z %Y")
print(df)
Al-Baraa El-Hag
  • 770
  • 6
  • 15
  • alright ;-) still - I cannot reproduce the OP's issue. With the same pandas version 1.1.2 on Python 3.8.5 x64, it makes almost no difference performance-wise if I specify a format or not. – FObersteiner Sep 24 '20 at 06:27
2

Here's the answer with a several possible solutions

The reason:

This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).

Fix for your issue with best performance

def format_datetime(dt_series):

    def get_split_date(strdt):
        split_date = strdt.split()
        str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
        return str_date

    dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')

    return dt_series


df["created_at"] = format_datetime(df["timestamp"])

Benchmarks

timestamps = [
    'Wed Nov 22 08:31:24 +0000 2017', 'Wed Nov 22 08:33:24 +0000 2018', 'Wed Nov 22 08:31:24 +0000 2019'
]
df = pd.DataFrame(timestamps * 300000, columns=['timestamp'])
%timeit df["created_at"] = pd.to_datetime(df["timestamp"]).dt.strftime('%Y-%m-%d %H:%M:%S')

4min 8s ± 1min 10s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df["created_at1"] = format_datetime(df["timestamp"])

5.6 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pavel Slepiankou
  • 3,516
  • 2
  • 25
  • 30