3

I'm challenged with a rather simple task: Coming from my SQL-Query I recieved three different columns: One for hours, one for minutes and one for seconds. I wanted them to be combined into a single time value.

My approach was to apply the dt.time function:

# Import relevant libraries
import datetime as dt
from timeit import timeit
import pandas as pd
import numpy as np

# Create an example Dataframe
rng = np.random.default_rng()
test = pd.DataFrame({"hours": rng.integers(0,24,1000000)
                    , "minutes" : rng.integers(0, 60, 1000000)
                    , "seconds":  rng.integers(0, 60, 1000000)
                    })

# Create my time function
test["time"] =  test.apply(lambda x: dt.time(x.hours, x.minutes, x.seconds), axis = 1)

The result is ridiculously slow in my real world scenario, clocking in with > 6 minutes for approximately 4 Mio. rows.

Ravi
  • 2,778
  • 2
  • 20
  • 32
MichaelA
  • 1,866
  • 2
  • 23
  • 38

2 Answers2

0

Apply seems to be rather slow, faster results are achieved when using the map function:

import datetime as dt
from timeit import timeit
import pandas as pd
import numpy as np

rng = np.random.default_rng()
test = pd.DataFrame({"hours": rng.integers(0,24,1000000)
                    , "minutes" : rng.integers(0, 60, 1000000)
                    , "seconds":  rng.integers(0, 60, 1000000)
                    })

%%timeit
test["time"] =  test.apply(lambda x: dt.time(x.hours, x.minutes, x.seconds), axis = 1)

%%timeit
test["time_map"] = list(map(lambda hours, minutes, seconds: dt.time(hours, minutes, seconds)
                            , test.hours
                            , test.minutes
                            , test.seconds)
                        )

The first version takes just short of 16 seconds, the second approach only half a second. That is more than 30 times faster.

The solution works great for me and can works also for similar tasks.

I created this Q&A question because I didn't find a fast solution and it took me some time to get a better version. It is a specific application of this question:

MichaelA
  • 1,866
  • 2
  • 23
  • 38
  • 3
    I would **avoid** using `datetime` object in Pandas since you can't add/subtract the time in that format. Not to mention the operation you mentioned is **not** vectorized by any mean. `test["time_map"] = pd.to_timedelta(test.hours*3600+test.minutes*60+test.seconds, unit='s')` took a mere 20 ms which is also 20x improvement from your suggested solution. – Quang Hoang Dec 21 '20 at 16:07
0

A very hackish approach would be to work with time deltas which is fast to manipulate. this works because your hours + minutes + seconds will always stay in the correct range (if you had let say 48 hours values this would fail)

q = (
    pd.to_timedelta(test.hours, unit="hours")
    + pd.to_timedelta(test.minutes, unit="minutes")
    + pd.to_timedelta(test.seconds, unit="seconds")
)
time_only = q.add(dt.datetime(2020,1,1)).apply(lambda x: x.time())

I would also recommend to modify your SQL query and cast it to a time format instead of doing this python side. This would be, in my opinion, the correct approach and should be much faster. Doing this modification on python side feels like a work around for a bad SQL query.

Steven G
  • 16,244
  • 8
  • 53
  • 77