63

I have the following dataframe:

user_id    purchase_date 
  1        2015-01-23 14:05:21
  2        2015-02-05 05:07:30
  3        2015-02-18 17:08:51
  4        2015-03-21 17:07:30
  5        2015-03-11 18:32:56
  6        2015-03-03 11:02:30

and purchase_date is a datetime64[ns] column. I need to add a new column df[month] that contains first day of the month of the purchase date:

df['month']
2015-01-01
2015-02-01
2015-02-01
2015-03-01
2015-03-01
2015-03-01

I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m in SQL. I have tried the following code:

     df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))

It works somehow but returns: 2015-01-01 14:05:21.

Aurora0001
  • 13,139
  • 5
  • 50
  • 53
chessosapiens
  • 3,159
  • 10
  • 36
  • 58

10 Answers10

107

Simpliest and fastest is convert to numpy array by to_numpy and then cast:

df['month'] = df['purchase_date'].to_numpy().astype('datetime64[M]')
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

Another solution with floor and pd.offsets.MonthBegin(1) and add pd.offsets.MonthEnd(0) for correct ouput if first day of month:

df['month'] = (df['purchase_date'].dt.floor('d') + 
                           pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

df['month'] = ((df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1))
                         .dt.floor('d'))
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

Last solution is create month period by to_period:

df['month'] = df['purchase_date'].dt.to_period('M')
print (df)
   user_id       purchase_date   month
0        1 2015-01-23 14:05:21 2015-01
1        2 2015-02-05 05:07:30 2015-02
2        3 2015-02-18 17:08:51 2015-02
3        4 2015-03-21 17:07:30 2015-03
4        5 2015-03-11 18:32:56 2015-03
5        6 2015-03-03 11:02:30 2015-03

... and then to datetimes by to_timestamp, but it is a bit slowier:

df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
print (df)
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

There are many solutions, so:

Timings (in pandas 1.2.3):

rng = pd.date_range('1980-04-01 15:41:12', periods=100000, freq='20H')
df = pd.DataFrame({'purchase_date': rng})  
print (df.head())



In [70]: %timeit df['purchase_date'].to_numpy().astype('datetime64[M]')
8.6 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [71]: %timeit df['purchase_date'].dt.floor('d') + pd.offsets.MonthEnd(n=0) - pd.offsets.MonthBegin(n=1)
23 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [72]: %timeit (df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)).dt.floor('d')
23.6 ms ± 97.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [73]: %timeit df['purchase_date'].dt.to_period('M')
9.25 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [74]: %timeit df['purchase_date'].dt.to_period('M').dt.to_timestamp()
17.6 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [76]: %timeit df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
23.1 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [77]: %timeit df['purchase_date'].dt.normalize().map(MonthBegin().rollback)
1.66 s ± 7.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 5
    The problem with `pd.offsets.MonthBegin(1)` is when purchase_date is the first day of the month, your code will return the first day of the previous month. – pomber Apr 07 '18 at 15:28
  • @pomber - Try use `pd.offsets.MonthBegin(0)` – jezrael Apr 07 '18 at 15:29
  • 5
    `pd.offsets.MonthBegin(0)` works for the first day but fails for the rest – pomber May 03 '18 at 14:07
  • 2
    I agree, the solution is flawed for when the date is the first day of the month. – bloo Aug 27 '19 at 10:18
  • How is the answer even accepted - it's just flawed and it completely ignores the corner cases (first or the last day the month, depending on parameters of `MonthBegin`) @jezrael, please exted your example with `2015-01-01` and `2015-01-31` to see the problem – Andrew Slabko Apr 07 '21 at 17:04
  • 1
    Great answer! Realised without converting to numpy also works for the first code snippet: `df['month'] = df['purchase_date'].astype('datetime64[M]')` (pandas v1.3.5) – Zolzaya Luvsandorj Jan 13 '22 at 00:58
  • @ZolzayaLuvsandorj this also works for pandas v1.0.3 – Tim Mar 16 '22 at 19:14
12

We can use date offset in conjunction with Series.dt.normalize:

In [60]: df['month'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)

In [61]: df
Out[61]:
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01

Or much nicer solution from @BradSolomon

In [95]: df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)

In [96]: df
Out[96]:
   user_id       purchase_date      month
0        1 2015-01-23 14:05:21 2015-01-01
1        2 2015-02-05 05:07:30 2015-02-01
2        3 2015-02-18 17:08:51 2015-02-01
3        4 2015-03-21 17:07:30 2015-03-01
4        5 2015-03-11 18:32:56 2015-03-01
5        6 2015-03-03 11:02:30 2015-03-01
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
11

How about this easy solution?
As purchase_date is already in datetime64[ns] format, you can use strftime to format the date to always have the first day of month.

df['date'] = df['purchase_date'].apply(lambda x: x.strftime('%Y-%m-01'))

print(df)
 user_id   purchase_date       date
0   1   2015-01-23 14:05:21 2015-01-01
1   2   2015-02-05 05:07:30 2015-02-01
2   3   2015-02-18 17:08:51 2015-02-01
3   4   2015-03-21 17:07:30 2015-03-01
4   5   2015-03-11 18:32:56 2015-03-01
5   6   2015-03-03 11:02:30 2015-03-01

Because we used strftime, now the date column is in object (string) type:

print(df.dtypes)
user_id                   int64
purchase_date    datetime64[ns]
date                     object
dtype: object

Now if you want it to be in datetime64[ns], just use pd.to_datetime():

df['date'] = pd.to_datetime(df['date'])

print(df.dtypes)
user_id                   int64
purchase_date    datetime64[ns]
date             datetime64[ns]
dtype: object
igorkf
  • 3,159
  • 2
  • 22
  • 31
9

Most proposed solutions don't work for the first day of the month.

Following solution works for any day of the month:

df['month'] = df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)

[EDIT]

Another, more readable, solution is:

from pandas.tseries.offsets import MonthBegin
df['month'] = df['purchase_date'].dt.normalize().map(MonthBegin().rollback)

Be aware not to use:

df['month'] = df['purchase_date'].map(MonthBegin(normalize=True).rollback)

because that gives incorrect results for the first day due to a bug: https://github.com/pandas-dev/pandas/issues/32616

kadee
  • 8,067
  • 1
  • 39
  • 31
  • This also works for last day of month `pd.to_datetime('2021-01-31')+pd.tseries.offsets.MonthEnd(0)-pd.tseries.offsets.MonthBegin() == Timestamp('2021-01-01 00:00:00')` – citynorman Mar 29 '21 at 16:23
6

Try this ..

df['month']=pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')

Out[187]: 
   user_id        purchase_date       month
0        1  2015-01-23 14:05:21  2015-01-01
1        2  2015-02-05 05:07:30  2015-02-01
2        3  2015-02-18 17:08:51  2015-02-01
3        4  2015-03-21 17:07:30  2015-03-01
4        5  2015-03-11 18:32:56  2015-03-01
5        6  2015-03-03 11:02:30  2015-03-01
BENY
  • 317,841
  • 20
  • 164
  • 234
3

To extract the first day of every month, you could write a little helper function that will also work if the provided date is already the first of month. The function looks like this:

def first_of_month(date):
    return date + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)

You can apply this function on pd.Series:

df['month'] = df['purchase_date'].apply(first_of_month)

With that you will get the month column as a Timestamp. If you need a specific format, you might convert it with the strftime() method.

df['month_str'] = df['month'].dt.strftime('%Y-%m-%d')
mfeyx
  • 57
  • 4
  • This worked great, even for the first day of the month. I vectorized it though instead of doing the `.apply()` like so: `df['date_flat'] = df['date'] + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)` – cheevahagadog Apr 16 '20 at 15:57
2

For me df['purchase_date'] - pd.offsets.MonthBegin(1) didn't work (it fails for the first day of the month), so I'm subtracting the days of the month like this:

df['purchase_date'] - pd.to_timedelta(df['purchase_date'].dt.day - 1, unit='d')
pomber
  • 23,132
  • 10
  • 81
  • 94
  • While this is the only solution that works for the first day of the month, it's much slower than using MonthBegin for some reason, were you able to speed it up? – Eyal Shulman Sep 26 '18 at 21:47
0

@Eyal: This is what I did to get the first day of the month using pd.offsets.MonthBegin and handle the scenario where day is already first day of month.

import datetime

from_date= pd.to_datetime('2018-12-01')

from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not from_date.is_month_start else from_date

from_date

result: Timestamp('2018-12-01 00:00:00')

from_date= pd.to_datetime('2018-12-05')

from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not rom_date.is_month_start else from_date

from_date

result: Timestamp('2018-12-01 00:00:00')

Til
  • 5,150
  • 13
  • 26
  • 34
0

Just adding my 2 cents, for the sake of completeness:

1 - transform purchase_date to date, instead of datetime. This will remove hour, minute, second, etc...

df['purchase_date'] = df['purchase_date'].dt.date

2 - apply the datetime replace, to use day 1 instead of the original:

df['purchase_date_begin'] = df['purchase_date'].apply(lambda x: x.replace(day=1))

This replace method is available on the datetime library:

from datetime import date

today = date.today()
month_start = today.replace(day=1)

and you can replace day, month, year, etc...

erickfis
  • 1,074
  • 13
  • 19
-1

try this Pandas libraries, where 'purchase_date' is date parameter placed into the module.

date['month_start'] = pd.to_datetime(sched_slim.purchase_date)
.dt.to_period('M')
.dt.to_timestamp()
miro_muras
  • 33
  • 6