36

I have a pandas DataFrame with index column = date.

Input:

            value
date    
1986-01-31  22.93
1986-02-28  15.46

I want to floor the date to the first day of that month

Output:

            value
date    
1986-01-01  22.93
1986-02-01  15.46

What I tried:

df.index.floor('M')
ValueError: <MonthEnd> is a non-fixed frequency

This is potentially because the df is generated by df = df.resample("M").sum() (The output of this code is the input at the beginning of the question)

I also tried df = df.resample("M", convention='start').sum(). However, it does not work.

I know in R, it is easy to just call floor(date, 'M').

FObersteiner
  • 22,500
  • 8
  • 42
  • 72
Jill Clover
  • 2,168
  • 7
  • 31
  • 51
  • Is performance an issue? I'd consider transforming the values into `datetime` objects for this, but this might be way to costly if you are trying to process millions of objects. – Alfe Feb 16 '17 at 21:47

11 Answers11

42

there is a pandas issue about the floor problem

the suggested way is

import pandas as pd
pd.to_datetime(df.date).dt.to_period('M').dt.to_timestamp()
Deo Leung
  • 848
  • 9
  • 9
  • `df.date.dt.to_period('M').dt.to_timestamp()` seems to be sufficient, the initial `pd.to_datetime` is not needed. – Zoltan Mar 15 '19 at 21:16
17

You can use timeseries offset MonthBegin

from pandas.tseries.offsets import MonthBegin
df['date'] = pd.to_datetime(df['date']) - MonthBegin(1)

Edit: The above solution does not handle the dates which are already floored to the beginning of the month. Here is an alternative solution.

Here is a dataframe with additional test cases:

            value
date    
1986-01-31  22.93
1986-02-28  15.46
2018-01-01  20.00
2018-02-02  25.00

With timedelta method,

df.index = pd.to_datetime(df.index)
df.index = df.index - pd.to_timedelta(df.index.day - 1, unit='d')


            value
date    
1986-01-01  22.93
1986-02-01  15.46
2018-01-01  20.00
2018-02-01  25.00
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • 5
    this is the only pandonic approach among all the answers (as a bonus this is vectorized) – Jeff Feb 17 '17 at 18:47
  • 4
    There's a bug with this method: it'd translate any date to the beginning of the following month, except the beginning of the month, which stays the same. i.e. 1-1-2018 -> 1-1-2018, but 2-1-2018 -> 1-2-2018... – Mr.WorshipMe Jan 16 '18 at 09:48
  • The timedelta method is right-on, worked perfectly for my use case, with one modification - adding the 'dt', so changing to 'df.index.dt.day' inside the to_timedelta(). – rocksteady Nov 18 '18 at 23:53
11

This will do the trick and no imports necessary. Numpy has a dtype datetime64 which by default pandas sets to [ns] as seen by checking the dtype. You can change this to month, which will start on the first of the month by accessing the numpy array and changing the type.

df.date = pd.to_datetime(df.date.values.astype('datetime64[M]'))

It would be nice if pandas would implement this with their own astype() method but unfortunately you cannot.

The above works for data as datetime values or strings, if you already have your data as datetime[ns] type you can omit the pd.to_datetime() and just do:

df.date = df.date.values.astype('datetime64[M]')
Grr
  • 15,553
  • 7
  • 65
  • 85
  • 1
    I don’t know in which version they imported that to the pandas astype, but currently `df.date.astype('datetime64[M]')` works (version 1.2.2 at least). You could update this answer. – Cimbali Jun 25 '21 at 14:01
8

Here's another 'pandonic' way to do it:

df.date - pd.Timedelta('1 day') * (df.date.dt.day - 1)
aldanor
  • 3,371
  • 2
  • 26
  • 26
  • 1
    This is very nice! Plus, it works with dask! (in contrast to Deo Leung's answer) – srs Aug 28 '18 at 14:54
  • This also works well with variable instances of Timestamps in addition to being vectorized. Just change the `df.date` with your Timestamp and it works great! – cheevahagadog Mar 14 '19 at 19:46
8

You can also use string datetime formating:

df['month'] = df['date'].dt.strftime('%Y-%m-01')

Mikhail Venkov
  • 358
  • 2
  • 11
3

Assume that you are dealing with the following data frame:

import pandas as pd

df = pd.DataFrame({'MyDate': ['2021-03-11', '2021-04-26', '2021-01-17']})
df['MyDate'] = pd.to_datetime(df.MyDate)

Which is:

    MyDate
0   2021-03-11
1   2021-04-26
2   2021-01-17

And you want to truncate the date to month:

df['Truncated'] = df['MyDate'] + pd.offsets.MonthBegin(-1)
# OR    
# df['Truncated'] = df['MyDate'] - pd.offsets.MonthBegin(1)
df

And you get:

      MyDate  Truncated
0 2021-03-11 2021-03-01
1 2021-04-26 2021-04-01
2 2021-01-17 2021-01-01

Important note: This approach DOES NOT work when the date is already floored to the first day of the month, For that reason we will provide other solutions too.

import pandas as pd

df = pd.DataFrame({'MyDate': ['2021-03-11', '2021-04-26', '2021-01-17', '2021-02-01']})
df['MyDate'] = pd.to_datetime(df.MyDate)

df['Truncated'] = df['MyDate'].dt.to_period('M').dt.to_timestamp()
print(df)

And you get:

 MyDate  Truncated
0 2021-03-11 2021-03-01
1 2021-04-26 2021-04-01
2 2021-01-17 2021-01-01

Finally, another approach could be the following:

df['Truncated'] = df['MyDate'].dt.strftime('%Y-%m-01')
print(df)

And you get:

    MyDate   Truncated
0 2021-03-11  2021-03-01
1 2021-04-26  2021-04-01
2 2021-01-17  2021-01-01
George Pipis
  • 1,452
  • 16
  • 12
  • 1
    your code won't work if date is first day of the month - it would became first day of the previous month – snowboi Jan 17 '22 at 21:19
2
dt_1 = "2016-02-01"
def first_day(dt):
    lt_split = dt.split("-")
    return "-".join([lt_split[0], lt_split[1], "01"])

print first_day(dt_1)

For Panda's DataFrame, you can use dt["col_name_date"].apply(first_day).

garrettmurray
  • 3,338
  • 1
  • 25
  • 23
W.Li
  • 21
  • 3
1

From August 2019:

This should work:

[x.replace(day=1).date() for x in df['date']]

Only requirement is to make sure date is a datetime, which we can guarantee with a call to pd.to_datetime(df['date'])

Community
  • 1
  • 1
Yuca
  • 6,010
  • 3
  • 22
  • 42
1

Liked Mikhail Venkov answer. Added below code to have the column added as a timestamp value and keep timezone information

    df['month'] = pd.to_datetime(df['timestamp'].dt.strftime('%Y-%m-01')).dt.tz_localize(timezone) 

where timezone = 'America/Los_Angeles' or whatever zone you want

user2557522
  • 131
  • 1
  • 6
0

One liner

df.set_index(
    df.index - pd.to_timedelta(pd.to_datetime(df.index).day - 1, unit="D")
)

(there is a warning about vectorization.)

Actually what I would rather do is to first make that index a proper column, do the operations on it and then make it an index again:

In [32]: df = pd.DataFrame(
                index=[datetime.date.fromisoformat("1986-01-31"), datetime.date.fromisoformat("1986-02-28")], 
                data={"value":[22.93, 15.46]})

In [33]: df
Out[33]:
            value
1986-01-31  22.93
1986-02-28  15.46

In [34]: df2 = df.reset_index().rename(columns={"index": "date"})

In [35]: df2
Out[35]:
         date  value
0  1986-01-31  22.93
1  1986-02-28  15.46

In [36]: df2.date = pd.to_datetime(df2.date)

In [37]: df2.date
Out[37]:
0   1986-01-31
1   1986-02-28
Name: date, dtype: datetime64[ns]

In [38]: df2.date -= pd.to_timedelta(df2.date.dt.day - 1, unit="D")

In [39]: df2
Out[39]:
        date  value
0 1986-01-01  22.93
1 1986-02-01  15.46

In [40]: df2.set_index("date")
Out[40]: 
            value
date             
1986-01-01  22.93
1986-02-01  15.46

The most important line is:

df2.date -= pd.to_timedelta(df2.date.dt.day - 1, unit="D")

where you actually remove the day offset.

Raffi
  • 3,068
  • 31
  • 33
0

I needed both floor and ceiling functions. The following worked for me:

import pandas as pd

def monthfloor(ts):
    offset = pd.offsets.Hour(1)

    if offset.is_month_start(ts):
        return ts
    else:
        return ts - pd.offsets.MonthBegin()

def monthceil(ts):
    offset = pd.offsets.Hour(1)

    if offset.is_month_end(ts):
        return ts
    else:
        return ts + pd.offsets.MonthEnd()

begin = pd.Timestamp(2023, 4, 1)
middle = pd.Timestamp(2023, 4, 15)
end = pd.Timestamp(2023, 4, 30)

tss = pd.Series([begin, middle, end])

print(tss.map(monthfloor))
# 0   2023-04-01
# 1   2023-04-01
# 2   2023-04-01
# dtype: datetime64[ns]

print(tss.map(monthceil))
# 0   2023-04-30
# 1   2023-04-30
# 2   2023-04-30
# dtype: datetime64[ns]
nmaclaren
  • 73
  • 6