3

I have three columns(h1, h2, h3) representing day, month and year respectively, e.g.

import pandas as pd

df = pd.DataFrame({
    'h1': [1,2,3],
    'h2': [1,2,3],
    'h3': [2000,2001,2002]
})

when i perform:

pd.to_datetime(df[['h1', 'h2', 'h3']])

this result into an error: ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing but when i rename the columns and then perform the pd.to_datetime e.g

df=df.rename(columns ={'h1':'day', 'h2':'month', 'h3': 'year'})
df["date_col"] =pd.to_datetime(df[['day','month','year']])

on it I get the year column, Do we have to be doing it this way? or is it possible to provide a format so that the columns can be detected as day, month , year respectively? Thanks.

Anton vBR
  • 18,287
  • 5
  • 40
  • 46
Lunalo John
  • 325
  • 3
  • 10

1 Answers1

5

Summing up:

Your approach with renaming the columns is already smart as the docs says:

Examples

Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same

But there are some alternatives. In my experience the list comprehension using zip is quite fast (for small sets). With around 3000 rows of data renaming the columns becomes the quickest. Looking at the graph the penalty for renaming is hard for small set but compensates for large ones.

Alternatives

pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])
pd.to_datetime(['-'.join(i) for i in df[['h3', 'h2', 'h1']].values.astype(str)])
df[['h3','h2','h1']].astype(str).apply(lambda x: pd.to_datetime('-'.join(x)), 1)
pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day', 'h2':'month','h3':'year'}))

Timings Win10:

#df = pd.concat([df]*1000)
2.74 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
8.08 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
158 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.64 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Timings MacBook Air:

100 loops, best of 3: 6.1 ms per loop
100 loops, best of 3: 12.7 ms per loop
1 loop, best of 3: 335 ms per loop
100 loops, best of 3: 4.7 ms per loop

Update with code I wrote (happy if you have suggestions to improve or any library that could help):

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
from collections import defaultdict

df = pd.DataFrame({
    'h1': np.arange(1,11),
    'h2': np.arange(1,11),
    'h3': np.arange(2000,2010)
})

myfuncs = {
"pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])])":
    lambda: pd.to_datetime(['-'.join(map(str,i)) for i in zip(df['h3'],df['h2'],df['h1'])]),
"pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)])":
    lambda: pd.to_datetime(['-'.join(i) for i in df[['h3','h2', 'h1']].values.astype(str)]),
"pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))":
    lambda: pd.to_datetime(df[['h1','h2','h3']].rename(columns={'h1':'day','h2':'month','h3':'year'}))
}

d = defaultdict(dict)
step = 10
cont = True
while cont:
    lendf = len(df); print(lendf)
    for k,v in mycodes.items():
        iters = 1
        t = 0
        while t < 0.2:
            ts = timeit.repeat(v, number=iters, repeat=3)
            t = min(ts)
            iters *= 10
        d[k][lendf] = t/iters
        if t > 2: cont = False
    df = pd.concat([df]*step)

pd.DataFrame(d).plot().legend(loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.yscale('log'); plt.xscale('log'); plt.ylabel('seconds'); plt.xlabel('df rows')
plt.show()

Returns:

enter image description here

Community
  • 1
  • 1
Anton vBR
  • 18,287
  • 5
  • 40
  • 46
  • 1
    Even faster: `pd.to_datetime(['-'.join(i) for i in df[['h3', 'h2', 'h1']].values.astype(str)])`; inspired by [this answer](https://stackoverflow.com/a/49804868/9209546). – jpp Jun 02 '18 at 11:59
  • @jpp Nice link, however I dont get those results. It is about 3x slower. with my timings – Anton vBR Jun 02 '18 at 14:46
  • Yes that is indeed strange. maybe a 3rd person could enlighten us! :) I posted the code I am running – Anton vBR Jun 02 '18 at 14:49
  • For reference, I see `53.1ms / 20.5ms / 500ms / 7.85ms` with your exact code; renaming still the best. But the first vs second have v different relative performance; Dual-core 2.5GHz, Win7, 4GB memory, pandas 0.20.3, python 3.6.2 – jpp Jun 02 '18 at 15:08
  • @jpp Interesting. I am running i7 6700 on Win10, 16GB ddr4. Also tried it on my macbook (OSX) – Anton vBR Jun 02 '18 at 15:11
  • 1
    @jpp what do you think of the function I added? – Anton vBR Jun 02 '18 at 16:42
  • 1
    @jezrael Actually it is this code I am trying to improve. – Anton vBR Jun 03 '18 at 07:15
  • @AntonvBR - thanks, what about use [`perplots`](https://stackoverflow.com/a/50389889/2901002)? – jezrael Jun 03 '18 at 08:05
  • 1
    @jezrael Interesting, it is similar. – Anton vBR Jun 03 '18 at 08:10
  • I don't understand the performance angle? Was that in an old version of the question? – Niels Bom May 26 '21 at 09:47