One way is to add a column for the year, month and day:
df['year'] = df.SomeDatetimeColumn.map(lambda x: x.year)
df['month'] = df.SomeDatetimeColumn.map(lambda x: x.month)
df['day'] = df.SomeDatetimeColumn.map(lambda x: x.day)
Then group by the year and month, order by day, and take only the first entry (which will be the minimum day entry).
df.groupby(
['year', 'month']
).apply(lambda x: x.sort('day', ascending=True)).head(1)
The use of the lambda
expressions makes this less than ideal for large data sets. You may not wish to grow the size of the data by keeping separately stored year, month, and day values. However, for these kinds of ad hoc date alignment problems, sooner or later having these values separated is very helpful.
Another approach is to group directly by a function of the datetime column:
dfrm.groupby(
by=dfrm.dt.map(lambda x: (x.year, x.month))
).apply(lambda x: x.sort('dt', ascending=True).head(1))
Normally these problems arise because of a dysfunctional database or data storage schema that exists one level prior to the Python/pandas layer.
For example, in this situation, it should be commonplace to rely on the existence of a calendar database table or a calendar data set which contains (or makes it easy to query for) the earliest active date in a month relative to the given data set (such as, the first trading day, the first week day, the first business day, the first holiday, or whatever).
If a companion database table exists with this data, it should be easy to combine it with the dataset you already have loaded (say, by joining on the date column you already have) and then it's just a matter of applying a logical filter on the calendar data columns.
This becomes especially important once you need to use date lags: for example, lining up a company's 1-month-ago market capitalization with the company's current-month stock return, to calculate a total return realized over that 1-month period.
This can be done by lagging the columns in pandas with shift
, or trying to do a complicated self-join that is likely very bug prone and creates the problem of perpetuating the particular date convention to every place downstream that uses data from that code.
Much better to simply demand (or do it yourself) that the data must have properly normalized date features in its raw format (database, flat files, whatever) and to stop what you are doing, fix that date problem first, and only then get back to carrying out some analysis with the date data.