5

I am trying to construct a datetime column in Pandas that represents multiple columns describing the year, month, day, etc. Most of the other answers I can find on this topic involve processing data in the opposite direction (from datetime to integer hour, for instance).

df = pd.DataFrame()

df['year'] = [2019, 2019, 2019, 2019, 2019, 2019]
df['month'] = [8, 8, 8, 8, 8, 8]
df['day'] = [1, 1, 1, 1, 1, 1]
df['hour'] = [10,10,11,11,12,12]
df['minute'] = [15,45,20,40,10,50]
df['second'] = [0, 1, 5, 10, 10, 11]

Goal:

df['datetime_val'] = 
0   2019-08-01 10:15:00
1   2019-08-01 10:45:01
2   2019-08-01 11:20:05
3   2019-08-01 11:40:10
4   2019-08-01 12:10:10
5   2019-08-01 12:50:11
Name: datetime_vals, dtype: datetime64[ns]

In the example above, how could I rapidly create a datetime column representing the constituent time information? I could easily do this with .apply() and a helper function but I envision performing this operation for millions of rows. I would love something inbuilt / vectorized. Thanks!

MattM
  • 317
  • 4
  • 12
  • Maybe take a peak at: https://stackoverflow.com/questions/11858472/string-concatenation-of-two-pandas-columns , especially the 3rd answer that compares different methods of column string comprehension – PeptideWitch Aug 01 '19 at 23:34

3 Answers3

8

IIUC to_datetime can take dataframe , only if the columns is well named as yours

pd.to_datetime(df)
0   2019-08-01 10:15:00
1   2019-08-01 10:45:01
2   2019-08-01 11:20:05
3   2019-08-01 11:40:10
4   2019-08-01 12:10:10
5   2019-08-01 12:50:11
dtype: datetime64[ns]
BENY
  • 317,841
  • 20
  • 164
  • 234
2

After reading through this comparison of string concatenation methods for pandas dataframes, it looks like you could benefit from using df.assign:

df.assign(datetime_val=[f"{str(year)}-{str(month)}-{str(day)} {str(hour)}:{str(minute)}:{str(second)}" for year, month, day, hour, minute, second in zip(df['year'], df['month'], df['day'], df['hour'], df['minute'], df['second'])])

EDIT2:

My method does not return datetime64 objects, however, as pointed out below by Andy L. In fact, method 3 becomes incredibly slow when swapping out the strings for datetime objects. However, the method 1 vs method 2 comparison is still valid.

EDIT:

Did some testing to compare the three methods presented here

enter image description here

PeptideWitch
  • 2,239
  • 14
  • 30
  • 2
    @AndyL. Actually, fair call - OP stated that they wanted datetime64 as their data type while my method only returns strings. My apologies. If you edit your answer (like, add a space somewhere), I'll un-downvote it. Nevertheless, it's still slower than WeNYoBen's method. – PeptideWitch Aug 02 '19 at 00:44
  • I edited it. I never dispute your speed test. WeNYoBen's solution is better. That's why I upvoted him. – Andy L. Aug 02 '19 at 00:54
  • 2
    My philosophy is any solution always worth effort of the poster, so I never downvote any solution even that solution is wrong. On wrong solution, I may comment, but never downvote. I am kind of off balance in this case. I apologizes for my harsh words. I deleted those comments. Please edit yours so I can retract my downvoted on yours – Andy L. Aug 02 '19 at 00:59
  • 2
    Oh, thank you. That's kind of you. I like your philosophy, actually...and maybe in my rush to get the answer right, I became a bit arrogant too by trampling over your answer. It's useful for me to hear when I'm wrong, even if the words are a little harsh, so no need to apologise in that regard :) – PeptideWitch Aug 02 '19 at 01:05
2

you may convert whole df to str and use agg to concat string and with format parameter of pd.to_datetime

df = df.astype(str)
pd.to_datetime(df.agg('-'.join, axis=1), format='%Y-%m-%d-%H-%M-%S')

Out[170]:
0   2019-08-01 10:15:00
1   2019-08-01 10:45:01
2   2019-08-01 11:20:05
3   2019-08-01 11:40:10
4   2019-08-01 12:10:10
5   2019-08-01 12:50:11
dtype: datetime64[ns]
Andy L.
  • 24,909
  • 4
  • 17
  • 29