Python DataFrame selecting the rows with monthly increment from daily data

Question

Let's get right into the question. The following is the daily data:

             AAA    BBB    CCC
date                           
2012-04-16  44.48  28.48  17.65
2012-04-17  44.59  28.74  17.65
2012-04-18  44.92  28.74  17.72
2012-04-19  44.92  28.62  17.72
2012-04-20  45.09  28.68  17.71
2012-04-23  45.09  28.40  17.76
2012-04-24  45.09  28.51  17.73
2012-04-25  45.01  28.76  17.73
2012-04-26  45.40  28.94  17.76
2012-04-27  45.57  29.02  17.79
2012-04-30  45.45  28.90  17.80
2012-05-01  45.79  29.07  17.80
2012-05-02  45.71  28.98  17.77
2012-05-03  45.44  28.81  17.79
2012-05-04  45.05  28.48  17.79
2012-05-07  45.05  28.48  17.79
2012-05-08  45.00  28.40  17.93
2012-05-09  44.87  28.30  17.94
2012-05-10  44.93  28.34  17.85
2012-05-11  44.86  28.30  17.96
           ...    ...    ...

I want to select the rows starting from the first row with a monthly increment, that is, the rows whose index is 2012-04-16, 2012-05-16, 2012-06-16, ... . I can just use relativedelta and manually add them but I'm wondering if there is a more efficient method. I tried resampling, but I could only choose the first or last of each month as in df.resample('M').first().

What makes the problem more complicated is that some of the dates are missing; they are business days but not those of U.S.. There are several ways to handle this problem:

Choose the exact date or the earlier one closest to the date. If such date is nonexistent, then start looking up for the later dates.
Choose the exact date or the later one closest to the date. If such date is nonexistent, then start looking up for the earlier dates.
Choose the closest date to the exact date regardless of being early or late; I can use min(df.index, key=lambda x: abs(x - (df.index[0] + relativedelta(months=1))).

And in each of these cases, I wonder which method is the most efficient and easy to read. In the last code example, the month is a variable so I'm not sure if I can make it as a lambda procedure and use 'apply'.

Thanks in advance.

may be [this](http://earthpy.org/time_series_analysis_with_pandas_part_2.html) can help you. — Karn Kumar, Oct 04 '18 at 05:22

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

Before we look at your data, let's first see how we can create a DatetimeIndex for a specific day of each month. Since the regular pd.date_range with monthly frequency takes the last day of each month, we can simply add a fixed number of days:

idx = pd.date_range('2018-04-01', '2018-07-01', freq='1M') + pd.DateOffset(days=16)

DatetimeIndex(['2018-05-16', '2018-06-16', '2018-07-16'],
              dtype='datetime64[ns]', freq=None)

Now let's take an example dataframe which has some 16^th days missing:

              AAA    BBB    CCC
date                           
2012-04-16  44.48  28.48  17.65
2012-04-17  44.59  28.74  17.65
2012-05-15  45.79  29.07  17.80
2012-05-16  45.71  28.98  17.77
2012-05-17  45.44  28.81  17.79
2012-06-15  44.87  28.30  17.94
2012-06-17  44.95  28.50  17.98
2012-07-14  44.65  28.25  17.87
2012-07-17  44.55  28.75  17.75

As you mention, there are a number of ways you can decide on how to select non-matching days, either go backwards, forwards, or look for nearest with no preference. You need to consider what's most appropriate in the context of your project. Below is a solution which sticks to Pandas functionality and avoids custom lambda functions.

Define dataframe with `DatetimeIndex`

First create a dataframe with only required indices specified:

offset = pd.DateOffset(days=16)
start_date = df.index[0]-pd.DateOffset(months=1)
idx = pd.date_range(start_date, df.index[-1], freq='1M') + offset

df_idx = pd.DataFrame(index=idx)

Notice we need to subtract a month from the start argument, so that the first month is not omitted after adding 16 days. Now you can use pd.merge_asof with a variety of options:-

Match backwards / forwards / nearest via `merge_asof`

Specify direction argument as 'backward' (default), 'forward' or 'nearest' as appropriate. For example, using 'forward':

print(pd.merge_asof(df_idx, df, left_index=True, right_index=True, direction='forward'))

              AAA    BBB    CCC
2012-04-16  44.48  28.48  17.65
2012-05-16  45.71  28.98  17.77
2012-06-16  44.95  28.50  17.98
2012-07-16  44.55  28.75  17.75

This now may be sufficient for your needs.

Edit: If you want to keep the index from the dataframe, you can reverse the direction of the merge and use 'backward' instead of 'forward':

res = pd.merge_asof(df.reset_index(),
                    df_idx.reset_index().rename(columns={'index': 'date_idx'}),
                    left_on='date', right_on='date_idx', direction='backward')

res['diff'] = (res['date'] - res['date_idx']).dt.days.abs()
grouper = res['date'].dt.strftime('%Y-%m')
res = res[res['diff'] == res.groupby(grouper)['diff'].transform('min')]

print(res)

        date    AAA    BBB    CCC   date_idx  diff
0 2012-04-16  44.48  28.48  17.65 2012-04-16     0
3 2012-05-16  45.71  28.98  17.77 2012-05-16     0
6 2012-06-17  44.95  28.50  17.98 2012-06-16     1
8 2012-07-17  44.55  28.75  17.75 2012-07-16     1

Thank you so much @jpp Your answer helped me so much. I have some problems, though. (1) I wonder if the index can be chosen not as that of df_idx but that of the forwarded one (say, '2012-06-17' instead of '2012-06-16'). (2) If the offset days are set too late, the index will overflow into the next month so idx will have two indices with the same month. I wonder if the dateuitl module with relativedelta can be used to prevent this overflow. -- I've looked at the documentations of the functions you've used but I could't figure it out. A bit more help would be greatly appreciated! — Taxxi, Oct 12 '18 at 00:48
(1), Sure, a bit complicated but see update. (2) Just use a manual check for this, e.g. an `if` statement, it's a one-off task so won't be computationally expensive. (3) Avoid `dateutil` & `relativedelta` with Pandas. These libraries are designed for regular `datetime` objects, not for Pandas `pd.Timestamp` objects. Best not to mix the two. — jpp, Oct 12 '18 at 02:44
Your advise was very helpful. I guess I can construct idx by manually choosing the indices I want using a function and then apply your method. I'd be glad if you could tell me where I can learn pandas in an organized way; just reading the documents has been inefficient to me... — Taxxi, Oct 12 '18 at 03:42
@Taxxi, IMO, you can look at some of the tutorials [here](https://pandas.pydata.org/pandas-docs/stable/tutorials.html), but otherwise you may find learning by *trying things* and using SO when you are stuck is a useful process. — jpp, Oct 17 '18 at 11:05

Python DataFrame selecting the rows with monthly increment from daily data

1 Answers1

Define dataframe with `DatetimeIndex`

Match backwards / forwards / nearest via `merge_asof`

Linked

Python DataFrame selecting the rows with monthly increment from daily data

1 Answers1

Define dataframe with DatetimeIndex

Match backwards / forwards / nearest via merge_asof

Linked

Define dataframe with `DatetimeIndex`

Match backwards / forwards / nearest via `merge_asof`