How to convert columns into one datetime column in pandas?

Question

I have a dataframe where the first 3 columns are 'MONTH', 'DAY', 'YEAR'

In each column there is an integer. Is there a Pythonic way to convert all three columns into datetimes while there are in the dataframe?

From:

M    D    Y    Apples   Oranges
5    6  1990      12        3
5    7  1990      14        4
5    8  1990      15       34
5    9  1990      23       21

into:

Datetimes    Apples   Oranges
1990-6-5        12        3
1990-7-5        14        4
1990-8-5        15       34
1990-9-5        23       21

score 141 · Answer 1 · edited Aug 12 '22 at 18:43

In version 0.18.1 you can use to_datetime, but:

The names of the columns have to be year, month, day, hour, minute and second:
Minimal columns are year, month and day

Sample:

import pandas as pd

df = pd.DataFrame({'year': [2015, 2016],
                   'month': [2, 3],
                    'day': [4, 5],
                    'hour': [2, 3],
                    'minute': [10, 30],
                    'second': [21,25]})
    
print df
   day  hour  minute  month  second  year
0    4     2      10      2      21  2015
1    5     3      30      3      25  2016

print pd.to_datetime(df[['year', 'month', 'day']])
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

print pd.to_datetime(df[['year', 'month', 'day', 'hour']])
0   2015-02-04 02:00:00
1   2016-03-05 03:00:00
dtype: datetime64[ns]

print pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']])
0   2015-02-04 02:10:00
1   2016-03-05 03:30:00
dtype: datetime64[ns]

print pd.to_datetime(df)
0   2015-02-04 02:10:21
1   2016-03-05 03:30:25
dtype: datetime64[ns]

Another solution is convert to dictionary:

print df
   M  D     Y  Apples  Oranges
0  5  6  1990      12        3
1  5  7  1990      14        4
2  5  8  1990      15       34
3  5  9  1990      23       21

print pd.to_datetime(dict(year=df.Y, month=df.M, day=df.D))
0   1990-05-06
1   1990-05-07
2   1990-05-08
3   1990-05-09
dtype: datetime64[ns]

even (slightly) faster than the suggested one, because we're even dropping the minimal arithmetics, thanks jezrael! — CptSnuggles, Oct 05 '20 at 15:27
Leaving aside the performance discussion, I find the conversion to `dict` to be the easiest on the eyes. — Vlad Iliescu, Nov 20 '20 at 14:56

Jeff · Accepted Answer · 2013-10-13T23:09:54.383

71

In 0.13 (coming very soon), this is heavily optimized and quite fast (but still pretty fast in 0.12); both orders of magnitude faster than looping

In [3]: df
Out[3]: 
   M  D     Y  Apples  Oranges
0  5  6  1990      12        3
1  5  7  1990      14        4
2  5  8  1990      15       34
3  5  9  1990      23       21

In [4]: df.dtypes
Out[4]: 
M          int64
D          int64
Y          int64
Apples     int64
Oranges    int64
dtype: object

# in 0.12, use this
In [5]: pd.to_datetime((df.Y*10000+df.M*100+df.D).apply(str),format='%Y%m%d')

# in 0.13 the above or this will work
In [5]: pd.to_datetime(df.Y*10000+df.M*100+df.D,format='%Y%m%d')
Out[5]: 
0   1990-05-06 00:00:00
1   1990-05-07 00:00:00
2   1990-05-08 00:00:00
3   1990-05-09 00:00:00
dtype: datetime64[ns]

edited Oct 13 '13 at 23:09

answered Oct 13 '13 at 22:23

Jeff

125,376
21
220
187

Thank you, it worked, can you explain what the purposed of the *10000 and the *100 are? Nevermind, the purpose is to convert 2011, 5, 3, into 20110503 which can easily be read. Thanks!! – user1367204 Oct 14 '13 at 00:00
yep - this is in fact the fastest method because these are vectorized numeric operations and the conversion to datetime doesn't round trip thru strings – Jeff Oct 14 '13 at 00:05
8

I'd be interested to see whether you get better results from `pd.to_datetime(dict(year=df.Y, month=df.M, day=df.D))` – holdenweb Jun 10 '15 at 07:02
1

@holdenweb: according to `%timeit`, they're approximately the same - the numerical version is maybe 5-10% faster on the dataset I used. I will use the dict version, because it's more readable. – naught101 Jan 23 '19 at 03:07
See https://stackoverflow.com/questions/75115465/how-to-convert-year-month-day-hour-minute-columns-into-a-single-datetime-colu/75118182#75118182 – Laurent B. Jan 14 '23 at 13:16

unutbu · Answer 3 · 2014-09-02T03:35:03.480

Here is a alternative which uses NumPy datetime64 and timedelta64 arithmetic. It appears to be a bit faster for small DataFrames and much faster for larger DataFrames:

import numpy as np
import pandas as pd

df = pd.DataFrame({'M':[1,2,3,4], 'D':[6,7,8,9], 'Y':[1990,1991,1992,1993]})
#    D  M     Y
# 0  6  1  1990
# 1  7  2  1991
# 2  8  3  1992
# 3  9  4  1993

y = np.array(df['Y']-1970, dtype='<M8[Y]')
m = np.array(df['M']-1, dtype='<m8[M]')
d = np.array(df['D']-1, dtype='<m8[D]')
dates2 = pd.Series(y+m+d)
# 0   1990-01-06
# 1   1991-02-07
# 2   1992-03-08
# 3   1993-04-09
# dtype: datetime64[ns]

In [214]: df = pd.concat([df]*1000)

In [215]: %timeit pd.to_datetime((df['Y']*10000+df['M']*100+df['D']).astype('int'), format='%Y%m%d')
100 loops, best of 3: 4.87 ms per loop

In [216]: %timeit pd.Series(np.array(df['Y']-1970, dtype='<M8[Y]')+np.array(df['M']-1, dtype='<m8[M]')+np.array(df['D']-1, dtype='<m8[D]'))
1000 loops, best of 3: 839 µs per loop

Here's a helper function to make this easier to use:

def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
              seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
    years = np.asarray(years) - 1970
    months = np.asarray(months) - 1
    days = np.asarray(days) - 1
    types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
             '<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
    vals = (years, months, days, weeks, hours, minutes, seconds,
            milliseconds, microseconds, nanoseconds)
    return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
               if v is not None)

In [437]: combine64(df['Y'], df['M'], df['D'])
Out[437]: array(['1990-01-06', '1991-02-07', '1992-03-08', '1993-04-09'], dtype='datetime64[D]')

I think this, or this functionality at least, would be a nice enhancement for pandas. We should just figure out an API. — joris, Sep 01 '14 at 19:58
Yeah, having to do things like `*10000` or `-1970` is silly. We should definitely be able to combine standard time types in much simpler way. (And if there *is* a better way but none of us know it, then there's at least a doc bug..) — DSM, Sep 01 '14 at 19:59

user1367204 · Answer 4 · 2013-10-13T23:45:30.043

6

I re-approached the problem and I think I found a solution. I initialized the csv file in the following way:

pandas_object = DataFrame(read_csv('/Path/to/csv/file', parse_dates=True, index_col = [2,0,1] ))

Where the:

index_col = [2,0,1]

represents the columns of the [year, month, day]

Only problem now is that now I have three new index columns, one represent the year, another the month, and another the day.

edited Oct 13 '13 at 23:45

answered Oct 13 '13 at 23:30

user1367204

4,549
10
49
78

4

try `parse_dates=[[2,0,1]]` (note the double brackets. Have a look at the doctstring on `read_csv` for an example. – TomAugspurger Oct 14 '13 at 00:17

score 2 · Answer 5 · answered Jun 10 '15 at 04:30

2

Convert the dataframe to strings for easy string concatenation:

df=df.astype(str)

then convert to datetime, specify the format:

df.index=pd.to_datetime(df.Y+df.M+df.D,format="%Y%m%d")

which replaces the index rather than creating a new column.

answered Jun 10 '15 at 04:30

Q-man

2,069
1
17
16

score 2 · Answer 6 · answered Oct 08 '19 at 16:52

2

Even better way to do is as below:

import pandas as pd

import datetime

dataset = pd.read_csv('dataset.csv')

date=dataset.apply(lambda x: datetime.date(int(x['Yr']), x['Mo'], x['Dy']),axis=1)

date = pd.to_datetime(date)

dataset = dataset.drop(columns=['Yr', 'Mo', 'Dy'])

dataset.insert(0, 'Date', date)

dataset.head()

answered Oct 08 '19 at 16:52

dolly singh

31
1

1

Please, add some description or comments on what you doing in this code. – Sergei Sirik Oct 08 '19 at 17:28

score 1 · Answer 7 · edited Jan 23 '19 at 15:56

1

 [pd.to_datetime(str(a)+str(b)+str(c),
                 format='%m%d%Y'
                ) for a,b,c in zip(df.M, df.D, df.Y)]

edited Jan 23 '19 at 15:56

holdenweb

33,305
7
57
77

answered Nov 01 '16 at 13:51

A.Kot

7,615
2
22
24

score 0 · Answer 8 · answered Jun 10 '15 at 04:44

Let's assume you've got a dictionary foo with each column of dates in parallel. If so, here's your one liner:

>>> from datetime import datetime
>>> foo = {"M": [1,2,3], "D":[30,30,21], "Y":[1980,1981,1982]}
>>>
>>> df = pd.DataFrame({"Datetime": [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]})

The real guts of it are this bit:

>>> [datetime(y,m,d) for y,m,d in zip(foo["Y"],foo["M"],foo["D"])]
[datetime.datetime(1980, 1, 30, 0, 0), datetime.datetime(1981, 2, 28, 0, 0), datetime.datetime(1982, 3, 21, 0, 0)]

This is the sort of thing zip was made for. It takes parallel lists and turns them into tuples. Then they get tuple unpacked (the for y,m,d in bit) by the list comprehension there, then fed into the datetime object constructor.

pandas seems happy with the datetime objects.

How to convert columns into one datetime column in pandas?

8 Answers8

Linked

Related