0

I've got a dataframe with dates, and I want to select the highest date in each week excluding weekends (so Fridays, if available), unless there is no Monday to Friday data and Saturday/Sunday are the only one's available.

The sample data can be setup like this:

dates = pd.Series(data=['2018-11-05', '2018-11-06', '2018-11-07', '2018-11-08', '2018-11-09',
                        '2018-11-12', '2018-11-13', '2018-11-14', '2018-11-15', '2018-11-17',
                        '2018-11-19',
                        '2018-12-01',
                        ])
nums = np.random.randint(50, 100, 12)
# nums
# array([95, 80, 81, 51, 98, 62, 50, 55, 59, 77, 69])

df = pd.DataFrame(data={'dates': dates, 'nums': nums})
df['dates'] = pd.to_datetime(df['dates'])

The records I want:

  • 2018-11-09 is Friday
  • 2018-11-15 is Thursday (not 2018-11-17 coz it's Saturday)
  • 2018-11-19 is Monday and the only record for that week
  • 2018-12-01 is Saturday but the only record for that week

My current solution is in the answer below but I don't think it's ideal and has some issues I had to work around. Briefly, it's:

  1. groupby week: df.groupby(df['dates'].dt.week).apply(some_function)
  2. if there's just one record for that week, return it
  3. otherwise, select the highest/latest record with day <= Friday and return that

Ideally, I'd like a way to write:

[latest Mon-Fri record] if [has Mon-Fri record] else [latest Sat-Sun record]
aneroid
  • 12,983
  • 3
  • 36
  • 66

2 Answers2

1

Create a new hierarchy of weekdays, where Saturday and Sunday are given the lowest priority. Then sort_values on this new ranking + groupby + .tail(1).

import numpy as np

wd_map = dict(zip(np.arange(0,7,1), np.roll(np.arange(0,7,1),-2)))
# {0: 2, 1: 3, 2: 4, 3: 5, 4: 6, 5: 0, 6: 1}
df = df.assign(day_mapped = df.dates.dt.weekday.map(wd_map)).sort_values('day_mapped')

df.groupby(df.dates.dt.week).tail(1).sort_index()

Output

        dates  nums  day_mapped
4  2018-11-09    57           6
8  2018-11-15    83           5
10 2018-11-19    96           2
11 2018-12-01    66           0

If your data span multiple years, you'll need to group on both Year + week.

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    Thanks! This is much more succinct than what I had, and significantly less 'hand-written' logic and surely much much faster. – aneroid Nov 27 '18 at 22:25
0

I wrote a function to select the valid highest record for the week, which would need to be used on a weekly groupby:

def last_valid_report(recs):
    if len(recs) == 1:
        return recs
    recs = recs.copy()
    # recs = recs[recs['dates'].dt.weekday <= 4].nlargest(1, recs['dates'].dt.weekday)  # doesn't work
    recs['weekday'] = recs['dates'].dt.weekday  # because nlargest() needs a column name
    recs = recs[recs['weekday'] <= 4].nlargest(1, 'weekday')
    del recs['weekday']
    return recs
    # could have also done:
    # return recs[recs['weekday'] <= 4].nlargest(1, 'weekday').drop('weekday', axis=1)

Calling that with the correct groups, I get:

In [155]: df2 = df.groupby(df['dates'].dt.week).apply(last_valid_report)

In [156]: df2
Out[156]:
              dates  nums
dates
45    4  2018-11-09    63
46    8  2018-11-15    90
47    10 2018-11-19    80
48    11 2018-12-01    94

Couple of issues with this:

  1. If I don't put the recs.copy(), I get ValueError: Shape of passed values is (3, 12), indices imply (3, 4)

  2. pandas' nlargest will only use column names, not an expression.

    • so I need to create an extra column in the function and drop/delete it before returning it. I could also create this in the original df and drop it after the .apply().
  3. I'm getting an extra index-column 'dates', from the groupby+apply and needs to be explicitly dropped:

    In [157]: df2.index = df2.index.droplevel(); df2
    Out[157]:
            dates  nums
    4  2018-11-09    63
    8  2018-11-15    90
    10 2018-11-19    80
    11 2018-12-01    94
    
  4. If I get a record with Saturday and Sunday data (2 days), I need to add a check if recs[recs['weekday'] <= 4] is empty and then just use .nlargest(1, 'weekday') without filtering out weekday <= 4; but that's besides the point of the question.

aneroid
  • 12,983
  • 3
  • 36
  • 66