35

I have written a function to convert pandas datetime dates to month-end:

import pandas
import numpy
import datetime
from pandas.tseries.offsets import Day, MonthEnd

def get_month_end(d):
    month_end = d - Day() + MonthEnd() 
    if month_end.month == d.month:
        return month_end # 31/March + MonthEnd() returns 30/April
    else:
        print "Something went wrong while converting dates to EOM: " + d + " was converted to " + month_end
        raise

This function seems to be quite slow, and I was wondering if there is any faster alternative? The reason I noticed it's slow is that I am running this on a dataframe column with 50'000 dates, and I can see that the code is much slower since introducing that function (before I was converting dates to end-of-month).

df = pandas.read_csv(inpath, na_values = nas, converters = {open_date: read_as_date})
df[open_date] = df[open_date].apply(get_month_end)

I am not sure if that's relevant, but I am reading the dates in as follows:

def read_as_date(x):
    return datetime.datetime.strptime(x, fmt)
Anne
  • 6,752
  • 8
  • 33
  • 50

6 Answers6

57

Revised, converting to period and then back to timestamp does the trick

In [104]: df = DataFrame(dict(date = [Timestamp('20130101'),Timestamp('20130131'),Timestamp('20130331'),Timestamp('20130330')],value=randn(4))).set_index('date')

In [105]: df
Out[105]: 
               value
date                
2013-01-01 -0.346980
2013-01-31  1.954909
2013-03-31 -0.505037
2013-03-30  2.545073

In [106]: df.index = df.index.to_period('M').to_timestamp('M')

In [107]: df
Out[107]: 
               value
2013-01-31 -0.346980
2013-01-31  1.954909
2013-03-31 -0.505037
2013-03-31  2.545073

Note that this type of conversion can also be done like this, the above would be slightly faster, though.

In [85]: df.index + pd.offsets.MonthEnd(0) 
Out[85]: DatetimeIndex(['2013-01-31', '2013-01-31', '2013-03-31', '2013-03-31'], dtype='datetime64[ns]', name=u'date', freq=None, tz=None)
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 3
    This only works if there is 1 entry per month? I don't think all 50000 of Anne's entries are unique months. :) – Rutger Kassies Aug 14 '13 at 14:06
  • that's why ``resample`` is prob the right solution, unless she wants dups – Jeff Aug 14 '13 at 14:06
  • Thanks! Why do you need the `.values` at the end of your re-indexing? – Anne Aug 14 '13 at 15:12
  • Is the output @Jeff has in line 85 still correct in v.17? When I run this, I get `DatetimeIndex(['2012-12-31', '2012-12-31', '2013-02-28', '2013-02-28'], dtype='datetime64[ns]', name=u'date', freq=None)` – itzy Dec 19 '15 at 02:07
  • It seems that the `Period` class has incredibly slow performance...I too have a 50k row data frame and something like `df[df.period == some_period]` takes ~14 seconds, while following the above instructions and converting period back to datetime takes ~0.3 seconds, like a 100x speed increase. What gives? – tyleha Jan 01 '16 at 21:59
  • Periods are object dtype so of course it doesn't have great performance – Jeff Jan 01 '16 at 22:01
  • Is it possible to get the last business day of the month? For pd.DataFrame resampling, the parameter is `"BM"`, but that does not seem to work here. Edit: Got it: `df.index.levels[0] + pd.offsets.BMonthEnd(0) ` – Zhubarb Aug 22 '18 at 09:30
6

If the date column is in datetime format and is set to starting day of the month, this will add one month of time to it:

df['date1']=df['date'] + pd.offsets.MonthEnd(0) 
Dimanjan
  • 563
  • 6
  • 13
3
import pandas as pd
import numpy as np
import datetime as dt    

df0['Calendar day'] = pd.to_datetime(df0['Calendar day'], format='%m/%d/%Y')
df0['Calendar day'] = df0['Calendar day'].apply(pd.datetools.normalize_date)    
df0['Month Start Date'] = df0['Calendar day'].dt.to_period('M').apply(lambda r: r.start_time)

This code should work. Calendar Day is a column in which date is given in the format %m/%d/%Y. For example: 12/28/2014 is 28 December, 2014. The output comes out to be 2014-12-01 in class 'pandas.tslib.Timestamp' type.

1

you can also use numpy to do it faster:

import numpy as np
date_array = np.array(['2013-01-01', '2013-01-15', '2013-01-30']).astype('datetime64[ns]')
month_start_date = date_array.astype('datetime64[M]')
IanS
  • 15,771
  • 9
  • 60
  • 84
Matias Thayer
  • 571
  • 3
  • 8
1

In case the date is not in the index but in another column (works for Pandas 0.25.0):

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(date = [pd.Timestamp('20130101'), 
                               pd.Timestamp('20130201'), 
                               pd.Timestamp('20130301'), 
                               pd.Timestamp('20130401')], 
                       value = np.random.rand(4)))
print(df.to_string())

df.date = df.date.dt.to_period('M').dt.to_timestamp('M')
print(df.to_string())

Output:

    date     value
0 2013-01-01  0.295791
1 2013-02-01  0.278883
2 2013-03-01  0.708943
3 2013-04-01  0.483467

        date     value
0 2013-01-31  0.295791
1 2013-02-28  0.278883
2 2013-03-31  0.708943
3 2013-04-30  0.483467
Tony
  • 7,767
  • 2
  • 22
  • 51
0

What you are looking for might be:

df.resample('M').last()

The other method as said earlier by @Jeff:

df.index = df.index.to_period('M').to_timestamp('M')