2

I have a pandas dataframe that has 2 columns named date_x and date_y. I want to pull what season date_x is in (winter, spring, summer, or fall) in a separate column and then determine if a holiday falls with the range of dates exclusive of date_y. I have seen some potential solutions around that have done seasons by breaking down the year into days but I can't seem to make that work for me. I am still figuring out what pandas can do. Another problem I am faced with is how to account for holidays like Easter that change every year. Any help is truly appreciated.

I have also played around with trying to make this solution work for me but can't figure out how to adapt it to create a new column that works off of date_x instead of today's date

Below is an image of what I am striving for.

Date_x, Date_y, what season Date_x is in, What holiday, if any, is within the range of dates, exclusive of Date_y

Larry Flores
  • 83
  • 1
  • 7
  • The hardest part of this solution is the `Holiday` column output, but I think there is a specific library for that for US Holidays: https://www.geeks forgeeks.org/python-holidays-library/ – David Erickson Aug 22 '20 at 01:01
  • Yeah, that was my feeling too. I had no idea that there was a holiday python library though so that might come in handy, I still have trouble getting functions to work with pandas columns though. I don't know if that makes sense. – Larry Flores Aug 22 '20 at 01:09

3 Answers3

2

I believe you're looking for something like below. Notice that part of this solution uses the answer provided here.

You may have to adjust to account for your exclusive requirement, but this will give you the idea.

import pandas as pd
import numpy as np

import holidays # pip install holidays

from datetime import date, datetime

us_holidays = holidays.UnitedStates()

Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('winter', (date(Y,  1,  1),  date(Y,  3, 20))),
           ('spring', (date(Y,  3, 21),  date(Y,  6, 20))),
           ('summer', (date(Y,  6, 21),  date(Y,  9, 22))),
           ('autumn', (date(Y,  9, 23),  date(Y, 12, 20))),
           ('winter', (date(Y, 12, 21),  date(Y, 12, 31)))]

def get_season(dt):
    if isinstance(dt, datetime):
        dt = dt.date()
    dt = dt.replace(year=Y)
    return next(season for season, (start, end) in seasons
                if start <= dt <= end)

def get_holiday(data):
    return ",".join([us_holidays.get(x) for x in us_holidays[data["Date_X"] : data["Date_Y"]]])

np.random.seed(0)
rng_x = pd.date_range('2020-12-22', periods=365, freq='D')
rng_y = pd.date_range('2020-12-26', periods=365, freq='D')

df = pd.DataFrame({ 'Date_X': rng_x, 'Season_X': "", 'Date_Y': rng_y, 'Season_Y': ""}) 
print(df.head())

df['Season_X'] = df["Date_X"].apply(get_season)
df['Season_Y'] = df["Date_Y"].apply(get_season)
print(df.head())

df['Holiday'] = df.apply(get_holiday, axis=1)
print(df.head())

Output:

      Date_X Season_X     Date_Y Season_Y        Holiday
0 2020-12-22   winter 2020-12-26   winter  Christmas Day
1 2020-12-23   winter 2020-12-27   winter  Christmas Day
2 2020-12-24   winter 2020-12-28   winter  Christmas Day
3 2020-12-25   winter 2020-12-29   winter  Christmas Day
4 2020-12-26   winter 2020-12-30   winter
kerasbaz
  • 1,774
  • 1
  • 6
  • 15
1

Convert your columns to datetime type (if they aren't already) and construct two new empty columns:

df.Date_x = pd.to_datetime(df.Date_x)
df.Date_y = pd.to_datetime(df.Date_y)

Adding the Season column is easy once you notice the quarter attribute of datetime Series:

df['Season'] = df.Date_x.dt.quarter.map({1: 'Spring', 2: 'Summer', 3: 'Fall', 4: 'Winter'})

Holidays are obviously a bit more involved. Start with an empty column:

df['Holiday'] = ''

Easter is a special one, let's take care of that first. Pandas has it built in, though you can define your own Holiday rules too.

easter = df.Date_y - pd.tseries.offsets.Easter()

That gives you the Easter before Date_y (I could have used Date_x + Easter() but you said you wanted it exclusive of Date_y). Then:

df.loc[df.Date_x < easter, 'Holiday'] = 'Easter'

That puts "Easter" in the rows where the holiday is within [Date_x, Date_y).

I leave it as an exercise for you to create the holiday rules for the rest of your holidays. Here's some advice on that: Pandas Time Series Holiday Rule Offset

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Thanks for the heads up on the Holiday Offset, which will come in handy once I learn more about it. Thanks for taking the time to answer my question – Larry Flores Aug 22 '20 at 08:09
1

Let's say that you have a dataframe with dates that not only range multiple holidays, but they range multiple seasons:

        Date_x      Date_y
0   2020-12-22  2021-01-01
1   2020-06-20  2020-07-11
3   2020-02-11  2020-03-27
4   2020-05-22  2020-06-27

In order to get 1. Season and 2. Holiday:

  1. I built off the link you shared to customize seasons
  2. I tried to avoid "third-party" libraries and chose to use the USFederalHolidayCalendar from the pandas holiday library; because, I thought that would be more reliable; however, I do not have much experieince with holiday libraries. Also, there are multiple calendars that could be used from the pandas library. From there, I used the get_season(x) and get_holiday() function that I created. For the former, I would reference the link in your question, and the latter uses list comprehension to pull in holidays into your dataframe from the holidays dataframe I created.

from pandas.tseries.holiday import USFederalHolidayCalendar
from datetime import datetime
import pandas as pd

cal = USFederalHolidayCalendar()
holidays = (pd.DataFrame(cal.holidays(return_name=True), columns=['Holiday'])
            .reset_index()
            .rename({'index': 'Date'}, axis=1))
holidays['Date'] = pd.to_datetime(holidays['Date'])
df= pd.DataFrame({'Date_x': {0: '2020-12-22', 1: '2020-06-20', 2: '2020-02-11', 3: '2020-05-22'},
                  'Date_y': {0: '2021-01-01', 1: '2020-07-11', 2: '2020-03-27', 3: '2020-06-27'}})
df['Date_x'] = pd.to_datetime(df['Date_x'])
df['Date_y'] = pd.to_datetime(df['Date_y'])

Y = 2000 # dummy leap year to allow input X-02-29 (leap day)
seasons = [('Winter', (date(Y,  1,  1),  date(Y,  3, 20))),
           ('Spring', (date(Y,  3, 21),  date(Y,  6, 20))),
           ('Summer', (date(Y,  6, 21),  date(Y,  9, 22))),
           ('Fall', (date(Y,  9, 23),  date(Y, 12, 20))),
           ('Winter', (date(Y, 12, 21),  date(Y, 12, 31)))]

def get_season(x):
    x = x.replace(year=Y)
    return next(season for season, (start, end) in seasons
                if start <= x <= end)


def get_holiday():
    return pd.DataFrame([(h,y,z) for (h,d) in zip(holidays['Holiday'], holidays['Date'])
     for (y, z) in zip(df['Date_x'], df['Date_y']) if y.date() <= d.date() if d.date() <= z.date()], columns=['Holiday', 'Date_x', 'Date_y'])


s1 = df['Date_x'].apply(lambda x: get_season(x))
s2 = df['Date_y'].apply(lambda x: get_season(x))
df['Season']= [', '.join(list(set([x,y]))) for (x,y) in zip(s1,s2)]
dft = get_holiday()
dft = dft.groupby(['Date_x', 'Date_y'])['Holiday'].apply(lambda x: ', '.join(list(x)))
df = pd.merge(df, dft, how='left', on=['Date_x', 'Date_y'])
df

Out[32]: 
      Date_x     Date_y            Season                     Holiday
0 2020-12-22 2021-01-01            Winter    Christmas, New Years Day
1 2020-06-20 2020-07-11    Summer, Spring                    July 4th
2 2020-02-11 2020-03-27    Spring, Winter              Presidents Day
3 2020-05-22 2020-06-27    Summer, Spring                Memorial Day
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • This is awesome and exactly what I am trying to do! Although when I replace the dataframe you made with `df = pd.read_csv('dates.csv') i get a StopIteration error on line 23 here: 23 return next(season for season, (start, end) in seasons – Larry Flores Aug 22 '20 at 07:55
  • Would NULL values in the data cause this? – Larry Flores Aug 22 '20 at 07:57
  • 1
    @LarryFlores probably try removing those rows to test it out with `df = df[(df['Date_x'].notnull()) & (df['Date_y'].notnull())` – David Erickson Aug 22 '20 at 07:59
  • 1
    Yep, that was the issue, which was actually the main reason I started this thread because I kept getting the same error thinking I was doing something wrong. So when i saw that I got the same error with your sample I figured it maybe was just the NULL values. Either way, thank you so much for all your time, effort, and help, I truly appreciate it more than you know. This has been bugging me for some time. – Larry Flores Aug 22 '20 at 08:05
  • 1
    No problem @LarryFlores NULLS and Duplicates are the root of much evil that is out there in the data world... I have learned this lesson MANY times, so that's a key thing to look for when cleaning data for future. – David Erickson Aug 22 '20 at 08:15