Pandas shift on quarterly data with missing quarters

Question

I know there are some similar questions here, but please read on as I have looked and tried to adapt existing solutions without any luck. I have a data frame that pulls data for year and quarter. In the scenario shown below prevYearLeadCount displays data from Q1 2020. To be clear prevYearLeadCount will always display the lead count from the previous year in the same quarter. The below is just an example to show how the data is structured. Also, looking at the data below, since there is data for Q4 of 2019 I would expect 2020 Q4 prevYearLeadCount to equal 236

[
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2017,
        "quarter": 2,
        "leadCount": 151,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 2,
        "leadCount": 73,
        "prevYearLeadCount": 151.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 3,
        "leadCount": 271,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2018,
        "quarter": 4,
        "leadCount": 173,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 1,
        "leadCount": 209,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 2,
        "leadCount": 274,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 3,
        "leadCount": 311,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2019,
        "quarter": 4,
        "leadCount": 236,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 1,
        "leadCount": 245,
        "prevYearLeadCount": 209.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 2,
        "leadCount": 430,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 3,
        "leadCount": 907,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2020,
        "quarter": 4,
        "leadCount": 657,
        "prevYearLeadCount": 0.0
    },
    {
        "salesforceAccountId": 3148,
        "accountName": "Account Name",
        "year": 2021,
        "quarter": 1,
        "leadCount": 609,
        "prevYearLeadCount": 245.0
    }
]

Looking at the data above, I would expect for the year of 2020 to look like this:

{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 1,
    "leadCount": 209,
    "prevYearLeadCount": 209.0
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 2,
    "leadCount": 430,
    "prevYearLeadCount": 274
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 3,
    "leadCount": 907,
    "prevYearLeadCount": 311
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2020,
    "quarter": 4,
    "leadCount": 657,
    "prevYearLeadCount": 236 
},
{
    "salesforceAccountId": 3148,
    "accountName": "Account Name",
    "year": 2021,
    "quarter": 1,
    "leadCount": 609,
    "prevYearLeadCount": 245.0
}

As seen here I have tried the following:

df['prev_year_lead_count'] = df.groupby("quarter").lead_count.shift()[ (df.year == df.year.shift() + 1) ]

This is close as I get what I would expect in some cases, but not all. In some frames I see 0s where I should data definitely exist in the previous year and quarter. I am trying to do exactly what is seen here but each year is broken into quarters.

One more thing I have tried is combining python and pandas a bit. The idea here is to loop over existing years in the frame, and check the previous year to see if the quarter exist. If it does, do the pandas.

qs = [1, 2, 3, 4]
for year in leads_df["year"].unique():
    df = leads_df[leads_df["year"] == year - 1]
    for q in qs:
        if q in df["quarter"]:
            leads_df["prev_year_lead_count"] = leads_df.groupby("quarter")["lead_count"].shift(+1)
            leads_df["prev_year_cost"] = leads_df.groupby("quarter")["cost"].shift(+1)
            leads_df["prev_year_ga_spent"] = leads_df.groupby("quarter")["ga_spent"].shift(+1)
            leads_df["prev_year_fb_spent"] = leads_df.groupby("quarter")["fb_spent"].shift(+1)
            leads_df["prev_year_monthly_package_cost"] = leads_df.groupby("quarter")[
                "monthly_package_cost"
            ].shift(+1)
            leads_df["prev_year_cpl"] = leads_df.groupby("quarter")["cpl"].shift(+1)

The example df you posted won't work for the solution because the df only has one row. Shift calls a val from a previous or following row, and as such would need more than one row to work. — Boskosnitch, Mar 15 '21 at 17:34
Sorry, this was just an example of how the data is structured. — 00robinette, Mar 15 '21 at 18:33
No worries. But if you could provide some more thorough example data, it would be much easier to reproduce the issue and possibly provide a solution. — Boskosnitch, Mar 15 '21 at 18:42
There yah go! So a little more context - I will always pull 4 years of data. The problem is clients come and go. So I will always be missing a quarter here and there. — 00robinette, Mar 15 '21 at 18:54
Also, looking at the data above since there is data for Q4 of 2019 I would expect 2020 Q4 prevYearLeadCount to equal 236 — 00robinette, Mar 15 '21 at 19:04
Can you please share the expected output for the sample data you provided? — Joe Ferndz, Mar 15 '21 at 19:13
Sure thing! I added what I would expect for the year 2020 for brevity sake. If thats no good I can do for the series — 00robinette, Mar 15 '21 at 19:20
Try this. `df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift())`. It should solve it for you. — Joe Ferndz, Mar 15 '21 at 19:53

Joe Ferndz · Accepted Answer · 2021-03-16T06:18:38.177

fix to check for prev year only

To solve for one year difference, we need to trick groupby. Here's how to do it.

import pandas as pd

df = pd.DataFrame(d)

#find difference between years for each quarter

df['yeardiff'] = df.groupby(['quarter'])['year'].transform(lambda x: x - x.shift())

#create a condition to pick only NaN and difference of 1 year
#this will eliminate 2 years or more

cond = (df['yeardiff'].isnull() | (df['yeardiff'] == 1.0))

#use this condition while doing the groupby
#If condition not met, it will default to NaN

df['newprevYearLeadCount'] = df[cond].groupby(['quarter'])['leadCount'].transform(lambda x: x.shift())

print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])

Results of this looks like this:

I removed the entry for 2020 quarter 1. So 2021 quarter 1 should be NaN.

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        2        274                0.0                  73.0
6   2019        3        311                0.0                 271.0
7   2019        4        236                0.0                 173.0
8   2020        2        430                0.0                 274.0
9   2020        3        907                0.0                 311.0
10  2020        4        657                0.0                 236.0
11  2021        1        609              245.0                   NaN  #prev year LeadCount ignored

Another example with 2019 Quarter 2 excluded:

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        3        311                0.0                 271.0
6   2019        4        236                0.0                 173.0
7   2020        1        245              209.0                 209.0
8   2020        2        430                0.0                   NaN  #prev year LeadCount ignored
9   2020        3        907                0.0                 311.0
10  2020        4        657                0.0                 236.0
11  2021        1        609              245.0                 245.0

prev answer

You should be able to groupby(['quarter'], then do a shift() to get the results.

import pandas as pd
df = pd.DataFrame(d)
#df.sort_values(by=['quarter','year'],inplace=True)
#df.reset_index(drop=True,inplace=True)
df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift())
print (df[['year','quarter','leadCount','prevYearLeadCount', 'newprevYearLeadCount']])

The output of this will be:

    year  quarter  leadCount  prevYearLeadCount  newprevYearLeadCount
0   2017        2        151                0.0                   NaN
1   2018        2         73              151.0                 151.0
2   2018        3        271                0.0                   NaN
3   2018        4        173                0.0                   NaN
4   2019        1        209                0.0                   NaN
5   2019        2        274                0.0                  73.0
6   2019        3        311                0.0                 271.0
7   2019        4        236                0.0                 173.0
8   2020        1        245              209.0                 209.0
9   2020        2        430                0.0                 274.0
10  2020        3        907                0.0                 311.0
11  2020        4        657                0.0                 236.0
12  2021        1        609              245.0                 245.0

Initially I was going to sort_values by quarter then year but groupby takes care of it. So you need only groupby. transform takes care of assigning the value to each row.

If there are no records to pick the previous year Lead Count, then the value is set to NaN. You can decide to fillna(0) so it will get replaced with 0.0 instead.

If you need to have 0 instead of NaN, then do this:

df['newprevYearLeadCount'] = df.groupby(['quarter'])['leadCount'].transform(lambda x:x.shift()).fillna(0)

Any idea how to account for a missing year of data? If you have period 1 for 2019, and period 1 for 2021, you get 2019 data for previous year data. — 00robinette, Mar 16 '21 at 03:46
Hmmm. Yes, it will pick 2019 data as prevYear. Let me fix that. — Joe Ferndz, Mar 16 '21 at 03:52
Had to do some digging to get this right. @00robinette, see my updated answer. I think this should solve the issue — Joe Ferndz, Mar 16 '21 at 06:10

Pandas shift on quarterly data with missing quarters

1 Answers1

fix to check for prev year only

prev answer