Here's the answer with a several possible solutions
The reason:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
Fix for your issue with best performance
def format_datetime(dt_series):
def get_split_date(strdt):
split_date = strdt.split()
str_date = split_date[1] + ' ' + split_date[2] + ' ' + split_date[5] + ' ' + split_date[3]
return str_date
dt_series = pd.to_datetime(dt_series.apply(lambda x: get_split_date(x)), format = '%b %d %Y %H:%M:%S')
return dt_series
df["created_at"] = format_datetime(df["timestamp"])
Benchmarks
timestamps = [
'Wed Nov 22 08:31:24 +0000 2017', 'Wed Nov 22 08:33:24 +0000 2018', 'Wed Nov 22 08:31:24 +0000 2019'
]
df = pd.DataFrame(timestamps * 300000, columns=['timestamp'])
%timeit df["created_at"] = pd.to_datetime(df["timestamp"]).dt.strftime('%Y-%m-%d %H:%M:%S')
4min 8s ± 1min 10s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df["created_at1"] = format_datetime(df["timestamp"])
5.6 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)