Extracting the first day of month of a datetime type column in pandas

Question

I have the following dataframe:

user_id    purchase_date 
  1        2015-01-23 14:05:21
  2        2015-02-05 05:07:30
  3        2015-02-18 17:08:51
  4        2015-03-21 17:07:30
  5        2015-03-11 18:32:56
  6        2015-03-03 11:02:30

and purchase_date is a datetime64[ns] column. I need to add a new column df[month] that contains first day of the month of the purchase date:

df['month']
2015-01-01
2015-02-01
2015-02-01
2015-03-01
2015-03-01
2015-03-01

I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m in SQL. I have tried the following code:

     df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))

It works somehow but returns: 2015-01-01 14:05:21.

jezrael · Accepted Answer · 2021-04-08T08:38:12.110

Simpliest and fastest is convert to numpy array by to_numpy and then cast:

df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]')
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

Another solution with floor and pd.offsets.MonthBegin(1) and add pd.offsets.MonthEnd(0) for correct ouput if first day of month:

df['month'] = (df['purchase_date'].dt.floor('d') + 
                           pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
                         .dt.floor('d'))
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

Last solution is create month period by to_period:

df['month'] = df['purchase_date'].dt.to_period('M')
print (df)
   user_id       purchase_date   month
0        1 2015-01-23 14:05:21 2015-01
1        2 2015-02-05 05:07:30 2015-02
2        3 2015-02-18 17:08:51 2015-02
3        4 2015-03-21 17:07:30 2015-03
4        5 2015-03-11 18:32:56 2015-03
5        6 2015-03-03 11:02:30 2015-03

... and then to datetimes by to_timestamp, but it is a bit slowier:

df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

There are many solutions, so:

Timings (in pandas 1.2.3):

rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H')
df = pd.DataFrame({'purchase_date': rng})  
print (df.head())



In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]')
8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1)
23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d')
23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [73]: %timeit df['purchase_date'].dt.to_period('M')
9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp()
17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The problem with `pd.offsets.MonthBegin(1)` is when purchase_date is the first day of the month, your code will return the first day of the previous month. — pomber, Apr 07 '18 at 15:28
`pd.offsets.MonthBegin(0)` works for the first day but fails for the rest — pomber, May 03 '18 at 14:07
I agree, the solution is flawed for when the date is the first day of the month. — bloo, Aug 27 '19 at 10:18
How is the answer even accepted - it's just flawed and it completely ignores the corner cases (first or the last day the month, depending on parameters of `MonthBegin`) @jezrael, please exted your example with `2015-01-01` and `2015-01-31` to see the problem — Andrew Slabko, Apr 07 '21 at 17:04
Great answer! Realised without converting to numpy also works for the first code snippet: `df['month'] = df['purchase_date'].astype('datetime64[M]')` (pandas v1.3.5) — Zolzaya Luvsandorj, Jan 13 '22 at 00:58

MaxU - stand with Ukraine · Answer 2 · 2017-07-25T14:18:15.143

12

We can use date offset in conjunction with Series.dt.normalize:

In [60]: df['month'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)

In [61]: df
Out[61]:
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

Or much nicer solution from @BradSolomon

In [95]: df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)

In [96]: df
Out[96]:
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

edited Jul 25 '17 at 14:18

answered Jul 25 '17 at 13:24

MaxU - stand with Ukraine

205,989
36
386
419

3

Or pass `normalize` as param, `df.purchase_date - pd.offsets.MonthBegin(1, normalize=True))`. – Brad Solomon Jul 25 '17 at 14:14
@BradSolomon, wow! Thank you! I didn't know `pd.offsets.MonthBegin()` has such a parameter – MaxU - stand with Ukraine Jul 25 '17 at 14:16
4

@MaxU If `df['purchase_date']` is already on the first day of a month, minus `MonthBegin(1)` would return the first day of the previous month. – ricca Mar 02 '18 at 07:34
4

What should happen if the purchase_date is already the first day of the month? For `2017-08-01 03:45:56` I'm getting: `2017-07-01` – pomber Apr 07 '18 at 15:25

score 11 · Answer 3 · answered Feb 04 '20 at 13:13

How about this easy solution?
As purchase_date is already in datetime64[ns] format, you can use strftime to format the date to always have the first day of month.

df['date'] = df['purchase_date'].apply(lambda x: x.strftime('%Y-%m-01'))

print(df)
 user_id   purchase_date       date
0   1   2015-01-23 14:05:21 2015-01-01
1   2   2015-02-05 05:07:30 2015-02-01
2   3   2015-02-18 17:08:51 2015-02-01
3   4   2015-03-21 17:07:30 2015-03-01
4   5   2015-03-11 18:32:56 2015-03-01
5   6   2015-03-03 11:02:30 2015-03-01

Because we used strftime, now the date column is in object (string) type:

print(df.dtypes)
user_id                   int64
purchase_date    datetime64[ns]
date                     object
dtype: object

Now if you want it to be in datetime64[ns], just use pd.to_datetime():

df['date'] = pd.to_datetime(df['date'])

print(df.dtypes)
user_id                   int64
purchase_date    datetime64[ns]
date             datetime64[ns]
dtype: object

kadee · Answer 4 · 2020-03-13T10:15:50.713

9

Most proposed solutions don't work for the first day of the month.

Following solution works for any day of the month:

df['month'] = df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)

[EDIT]

Another, more readable, solution is:

from pandas.tseries.offsets import MonthBegin
df['month'] = df['purchase_date'].dt.normalize().map(MonthBegin().rollback)

Be aware not to use:

df['month'] = df['purchase_date'].map(MonthBegin(normalize=True).rollback)

because that gives incorrect results for the first day due to a bug: https://github.com/pandas-dev/pandas/issues/32616

edited Mar 13 '20 at 10:15

answered Jul 11 '19 at 15:50

kadee

8,067
1
39
31

This also works for last day of month `pd.to_datetime('2021-01-31')+pd.tseries.offsets.MonthEnd(0)-pd.tseries.offsets.MonthBegin() == Timestamp('2021-01-01 00:00:00')` – citynorman Mar 29 '21 at 16:23

BENY · Answer 5 · 2017-07-25T13:51:52.133

6

Try this ..

df['month']=pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')

Out[187]: 
   user_id        purchase_date       month
0        1  2015-01-23 14:05:21  2015-01-01
1        2  2015-02-05 05:07:30  2015-02-01
2        3  2015-02-18 17:08:51  2015-02-01
3        4  2015-03-21 17:07:30  2015-03-01
4        5  2015-03-11 18:32:56  2015-03-01
5        6  2015-03-03 11:02:30  2015-03-01

edited Jul 25 '17 at 13:51

answered Jul 25 '17 at 13:48

BENY

317,841
20
164
234

Simplest that doesn't involve another library (like numpy) – Ryan Ward Nov 04 '22 at 23:19

score 3 · Answer 6 · answered Sep 22 '19 at 12:34

To extract the first day of every month, you could write a little helper function that will also work if the provided date is already the first of month. The function looks like this:

def first_of_month(date):
    return date + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)

You can apply this function on pd.Series:

df['month'] = df['purchase_date'].apply(first_of_month)

With that you will get the month column as a Timestamp. If you need a specific format, you might convert it with the strftime() method.

df['month_str'] = df['month'].dt.strftime('%Y-%m-%d')

This worked great, even for the first day of the month. I vectorized it though instead of doing the `.apply()` like so: `df['date_flat'] = df['date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)` — cheevahagadog, Apr 16 '20 at 15:57

score 2 · Answer 7 · answered May 03 '18 at 14:15

2

For me df['purchase_date'] - pd.offsets.MonthBegin(1) didn't work (it fails for the first day of the month), so I'm subtracting the days of the month like this:

df['purchase_date'] - pd.to_timedelta(df['purchase_date'].dt.day - 1, unit='d')

answered May 03 '18 at 14:15

pomber

23,132
10
81
94

While this is the only solution that works for the first day of the month, it's much slower than using MonthBegin for some reason, were you able to speed it up? – Eyal Shulman Sep 26 '18 at 21:47

score 0 · Answer 8 · edited Mar 08 '19 at 05:50

@Eyal: This is what I did to get the first day of the month using pd.offsets.MonthBegin and handle the scenario where day is already first day of month.

import datetime

from_date= pd.to_datetime('2018-12-01')

from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not from_date.is_month_start else from_date

from_date

result: Timestamp('2018-12-01 00:00:00')

from_date= pd.to_datetime('2018-12-05')

from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not rom_date.is_month_start else from_date

from_date

result: Timestamp('2018-12-01 00:00:00')

score 0 · Answer 9 · answered Dec 02 '22 at 12:59

Just adding my 2 cents, for the sake of completeness:

1 - transform purchase_date to date, instead of datetime. This will remove hour, minute, second, etc...

df['purchase_date'] = df['purchase_date'].dt.date

2 - apply the datetime replace, to use day 1 instead of the original:

df['purchase_date_begin'] = df['purchase_date'].apply(lambda x: x.replace(day=1))

This replace method is available on the datetime library:

from datetime import date

today = date.today()
month_start = today.replace(day=1)

and you can replace day, month, year, etc...

score -1 · Answer 10 · answered Aug 11 '21 at 13:31

-1

try this Pandas libraries, where 'purchase_date' is date parameter placed into the module.

date['month_start'] = pd.to_datetime(sched_slim.purchase_date)
.dt.to_period('M')
.dt.to_timestamp()

answered Aug 11 '21 at 13:31

miro_muras

33
6

Not bad solution but this code won't run as is. Need brackets around the code after = – Ryan Ward Nov 04 '22 at 23:22

Extracting the first day of month of a datetime type column in pandas

10 Answers10

Linked

Related