0

Can someone please help me understand how I can count people whose status changed from one specific value to another specific value in the next month in Python?

My goal is to count people whose status is currently "Inactive".
Further, of those people, I only want to count those whose status was "Active" in the month prior to it changing to "Inactive".
I do not want want to include people who are currently Inactive but whose status was something other than Active in the month prior to it changing to Inactive.

I've attempted to share an example below.
In this example, I would want to count Alison but not Bill.

Status by month example data

|--------|-----------|----------|    |--------|-----------|-----------|
| Person | Month     | Status   |    | Person | Month     | Status    |
|--------|-----------|----------|    |--------|-----------|-----------|
| Alison | Jan 2023  | Active   |    | Bill   | Jan 2023  | Active    |
| Alison | Feb 2023  | Active   |    | Bill   | Feb 2023  | Active    |
| Alison | Mar 2023  | Active   |    | Bill   | Mar 2023  | Active    |
| Alison | Apr 2023  | Active   |    | Bill   | Apr 2023  | Transient |
| Alison | May 2023  | Inactive |    | Bill   | May 2023  | Inactive  |
| Alison | June 2023 | Inactive |    | Bill   | June 2023 | Inactive  |
| Alison | Jul 2023  | Inactive |    | Bill   | Jul 2023  | Inactive  |
|--------|-----------|----------|    |--------|-----------|-----------|

I haven't figured out a way to attempt to solve this yet. I've Googled quite a bit but haven't found this type of scenario addressed (I'm likely not using correct terms).

moken
  • 3,227
  • 8
  • 13
  • 23
  • can you provide the example of output table? – biyazelnut May 05 '23 at 01:57
  • Please provide a self-contained example that reproduces your problem (see [How to create a Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/14311263)). – Timus May 05 '23 at 07:32
  • For example, it is unclear if you have several dataframes - looks a bit like it - or all data is contained in one? – Timus May 05 '23 at 09:19
  • Thank you both for your quick help, and I apologize for being unclear in my original post. Yes, all data are in one dataframe. My desired output table would be a count of people whose status was "Inactive" as of Jul 2023, whose status was "Active" in the month before their status changed to "Inactive" (whenever that month was). There could be other values in the "Status" column besides "Active" and "Inactive". I only want to count people if their status was "Active" before it changed to "Inactive". I hope this helps. Thank you again. – Alison Gass May 05 '23 at 13:41

2 Answers2

0

You could:

  • locate the Inactive rows
  • subtract 1 month
  • .merge to find rows from the previous month
(
   df.loc[df["Status"] == "Inactive"]
     .assign(Month = df["Month"] - pd.DateOffset(months=1))
     .drop(columns="Status")
     .merge(df, how="left")
#     .loc[
#        lambda df: ~(df["Status"] != "Active")
#      ]
)
   Person      Month     Status
0  Alison 2023-04-01     Active
1  Alison 2023-05-01   Inactive
2  Alison 2023-06-01   Inactive
3    Bill 2023-04-01  Transient
4    Bill 2023-05-01   Inactive
5    Bill 2023-06-01   Inactive

With the .loc uncommented, you're left with:

   Person      Month  Status
0  Alison 2023-04-01  Active

It may need further processing depending if there can be multiple entries per month, etc.

df used:

import io
import pandas as pd

df = pd.read_csv(io.StringIO("""
Person,Month,Status
Alison,Jan 2023,Active
Alison,Feb 2023,Active
Alison,Mar 2023,Active
Alison,Apr 2023,Active
Alison,May 2023,Inactive
Alison,June 2023,Inactive
Alison,Jul 2023,Inactive
Bill,Jan 2023,Active
Bill,Feb 2023,Active
Bill,Mar 2023,Active
Bill,Apr 2023,Transient
Bill,May 2023,Inactive
Bill,June 2023,Inactive
Bill,Jul 2023,Inactive
""".strip()), parse_dates=["Month"])
jqurious
  • 9,953
  • 1
  • 4
  • 14
0

Many thanks to jqurious for their help. With their guidance and some additional searching on this site, I think I have code (shared below) that accomplishes my goal of counting people whose status changed from "Active" to "Inactive" at some point in time and exclude people from this count whose status changed in some other way (e.g., "Active" to "Transient"). This code is likely clunky but it worked so I'll take it! Just a note that the "Month" column in my original example above to "Status Month".

enter code here
#created a dataframe "df" that reads in csv file containing record listing of Person ID, Stauts Month, Status, and Person Count 

df = pd.read_csv(r'C:\Documents\Test file.csv')

#convert Status Month in "df" dataframe to date-time field
df['Status Month'] = df['Status Month'].astype('datetime64[ns]')

#create a new dataframe "df2" based on the original "df" limits to rows where Status = Inactive
df2 = df.loc[df['Status'] == 'Inactive']

#create a new dataframe "df3" based on the newly-created "df2" that substracts one month from the Status Month value associated with rows where Status = Inactive
#assign this calculation to a new field called "Month"
df3= df2.assign(Month = df['Status Month'] - pd.DateOffset(months=1))

#create a series "df4" based on the newly-created "df3" that returns the minimum "Month" value for each person
#"Month" in "df4" will now represent the month *before* the first row where Status = Inactive 
df4 = df3.groupby('Person ID')['Month'].min()

#create a new dataframe "df5" to convert the series object "df4" to a dataframe that can be further manipulated
df5 = df4.to_frame().reset_index()

#create a new dataframe "df6" based on the newly-created "df5" that merges "df5" together with the original "df" - merging on the combination of Person ID and Month values from "df5"
#pull in Status values from "df" from the month that matches the "Month" value for each person in "df5" 
df6 = pd.merge(df5, df,  how='left', left_on=['Person ID','Month'], right_on=['Person ID','Status Month'])

#count the number of rows from "df6" where Status = Active
print(sum(df6['Status'] == 'Active'))