How can I create a time value from three columns with pandas

Question

I'm challenged with a rather simple task: Coming from my SQL-Query I recieved three different columns: One for hours, one for minutes and one for seconds. I wanted them to be combined into a single time value.

My approach was to apply the dt.time function:

# Import relevant libraries
import datetime as dt
from timeit import timeit
import pandas as pd
import numpy as np

# Create an example Dataframe
rng = np.random.default_rng()
test = pd.DataFrame({"hours": rng.integers(0,24,1000000)
                    , "minutes" : rng.integers(0, 60, 1000000)
                    , "seconds":  rng.integers(0, 60, 1000000)
                    })

# Create my time function
test["time"] =  test.apply(lambda x: dt.time(x.hours, x.minutes, x.seconds), axis = 1)

The result is ridiculously slow in my real world scenario, clocking in with > 6 minutes for approximately 4 Mio. rows.

score 0 · Answer 1 · answered Dec 21 '20 at 15:28

Apply seems to be rather slow, faster results are achieved when using the map function:

import datetime as dt
from timeit import timeit
import pandas as pd
import numpy as np

rng = np.random.default_rng()
test = pd.DataFrame({"hours": rng.integers(0,24,1000000)
                    , "minutes" : rng.integers(0, 60, 1000000)
                    , "seconds":  rng.integers(0, 60, 1000000)
                    })

%%timeit
test["time"] =  test.apply(lambda x: dt.time(x.hours, x.minutes, x.seconds), axis = 1)

%%timeit
test["time_map"] = list(map(lambda hours, minutes, seconds: dt.time(hours, minutes, seconds)
                            , test.hours
                            , test.minutes
                            , test.seconds)
                        )

The first version takes just short of 16 seconds, the second approach only half a second. That is more than 30 times faster.

The solution works great for me and can works also for similar tasks.

I created this Q&A question because I didn't find a fast solution and it took me some time to get a better version. It is a specific application of this question:

I would **avoid** using `datetime` object in Pandas since you can't add/subtract the time in that format. Not to mention the operation you mentioned is **not** vectorized by any mean. `test["time_map"] = pd.to_timedelta(test.hours*3600+test.minutes*60+test.seconds, unit='s')` took a mere 20 ms which is also 20x improvement from your suggested solution. — Quang Hoang, Dec 21 '20 at 16:07

Steven G · Answer 2 · 2020-12-21T16:48:22.867

A very hackish approach would be to work with time deltas which is fast to manipulate. this works because your hours + minutes + seconds will always stay in the correct range (if you had let say 48 hours values this would fail)

q = (
    pd.to_timedelta(test.hours, unit="hours")
    + pd.to_timedelta(test.minutes, unit="minutes")
    + pd.to_timedelta(test.seconds, unit="seconds")
)
time_only = q.add(dt.datetime(2020,1,1)).apply(lambda x: x.time())

I would also recommend to modify your SQL query and cast it to a time format instead of doing this python side. This would be, in my opinion, the correct approach and should be much faster. Doing this modification on python side feels like a work around for a bad SQL query.

How can I create a time value from three columns with pandas

2 Answers2