pandas to_datetime() not detecting columns

Question

I have three columns(h1, h2, h3) representing day, month and year respectively, e.g.

import pandas as pd

df = pd.DataFrame({
    'h1': [1,2,3],
    'h2': [1,2,3],
    'h3': [2000,2001,2002]
})

when i perform:

pd.to_datetime(df[['h1', 'h2', 'h3']])

this result into an error: ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing but when i rename the columns and then perform the pd.to_datetime e.g

df=df.rename(columns ={'h1':'day', 'h2':'month', 'h3': 'year'})
df["date_col"] =pd.to_datetime(df[['day','month','year']])

on it I get the year column, Do we have to be doing it this way? or is it possible to provide a format so that the columns can be detected as day, month , year respectively? Thanks.

create date from those three columns(h1, h2, h3) using pandas library to_datetime method — Lunalo John, Jun 02 '18 at 08:21
Sure, It works fine but I do not want to rename columns before I create it — Lunalo John, Jun 02 '18 at 08:24
I have formatted your question in a way I think is more readable. I also removed some parts of my answer as they were mere copies. Hope you find it more readable as I do and maybe apply it in the future :) — Anton vBR, Jun 02 '18 at 08:46

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

Summing up:

Your approach with renaming the columns is already smart as the docs says:

Examples

Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same

But there are some alternatives. In my experience the list comprehension using zip is quite fast (for small sets). With around 3000 rows of data renaming the columns becomes the quickest. Looking at the graph the penalty for renaming is hard for small set but compensates for large ones.

Alternatives

pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])
pd.to_datetime(['-'.join(i) for i in df[['h3', 'h2', 'h1']].values.astype(str)])
df[['h3','h2','h1']].astype(str).apply(lambda x: pd.to_datetime('-'.join(x)), 1)
pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day', 'h2':'month','h3':'year'}))

Timings Win10:

#df = pd.concat([df]*1000)
2.74 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.08 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
158 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.64 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timings MacBook Air:

100 loops, best of 3: 6.1 ms per loop
100 loops, best of 3: 12.7 ms per loop
1 loop, best of 3: 335 ms per loop
100 loops, best of 3: 4.7 ms per loop

Update with code I wrote (happy if you have suggestions to improve or any library that could help):

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({
    'h1': np.arange(1,11),
    'h2': np.arange(1,11),
    'h3': np.arange(2000,2010)
})

myfuncs = {
"pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])":
    lambda: pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])]),
"pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)])":
    lambda: pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)]),
"pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))":
    lambda: pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in mycodes.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

Returns:

Even faster: `pd.to_datetime(['-'.join(i) for i in df[['h3', 'h2', 'h1']].values.astype(str)])`; inspired by [this answer](https://stackoverflow.com/a/49804868/9209546). — jpp, Jun 02 '18 at 11:59
@jpp Nice link, however I dont get those results. It is about 3x slower. with my timings — Anton vBR, Jun 02 '18 at 14:46
Yes that is indeed strange. maybe a 3rd person could enlighten us! :) I posted the code I am running — Anton vBR, Jun 02 '18 at 14:49
For reference, I see `53.1ms / 20.5ms / 500ms / 7.85ms` with your exact code; renaming still the best. But the first vs second have v different relative performance; Dual-core 2.5GHz, Win7, 4GB memory, pandas 0.20.3, python 3.6.2 — jpp, Jun 02 '18 at 15:08
@jpp Interesting. I am running i7 6700 on Win10, 16GB ddr4. Also tried it on my macbook (OSX) — Anton vBR, Jun 02 '18 at 15:11
@AntonvBR - thanks, what about use [`perplots`](https://stackoverflow.com/a/50389889/2901002)? — jezrael, Jun 03 '18 at 08:05
I don't understand the performance angle? Was that in an old version of the question? — Niels Bom, May 26 '21 at 09:47

pandas to_datetime() not detecting columns

1 Answers1

Summing up:

Alternatives

Timings Win10:

Timings MacBook Air: