0

I have the following dataframe with a tricky problem:

Disease  State       Month      Value
Covid    Texas     2020-03        2     
Covid    Texas     2020-04        3     
Covid    Texas     2020-05        4      
Covid    Texas     2020-08        3 
Cancer   Florida   2020-04        4     
Covid    Florida   2020-03        6      
Covid    Florida   2020-04        4      
Flu      Florida   2020-03        5         

I have to make a list of values for 3 consecutive months and create a new dataframe. However, there are some conditions:

  1. The list will be created for each disease, each month (from start to end: 2020 Feb - 2021 April) and each state.

  2. If any specific month is absent in the dataset, row for that month would be created and the value for that month would be 0.

Desired output:

Disease State    Month      ValueList
Covid   Texas    2020-02    [0, 2, 3] (no dataset for Feb 20 but next two months are) 
Covid   Texas    2020-03    [2, 3, 4] (has values for 3 consecutive months)
Covid   Texas    2020-04    [3, 4, 0] (doesn’t have value for 6th month)   
Covid   Texas    2020-05    [4, 0, 0] (has value for present month)
Covid   Texas    2020-06    [0, 0, 3] (has value for 8th month)
Covid   Texas    2020-07    [0, 3, 0] (has value for 8th month)
Covid   Texas    2020-08    [3, 0, 0] (has value for present month)
Covid   Texas    2020-09    [0, 0, 0] (no dataset for next 3 months)  
Covid   Texas    2020-10    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-11    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2020-12    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-01    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-02    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-03    [0, 0, 0] (no dataset for next 3 months)
Covid   Texas    2021-04    [0, 0, 0] (no dataset for next 3 months)

I am trying to fill in dates using this:

df3= (df2.set_index('MonthEnd')
   .groupby(['Disease', 'State']).apply(lambda x: x.drop(['Disease', 'State'], axis=1).asfreq('D'))
   .reset_index())

However, it doesn't returns the same time frame for each group. It returns the values between the min and max date in that group.

I’m not sure how I should start. Any help would be appreciated. Thanks!

Roy
  • 924
  • 1
  • 6
  • 17
  • check out groupby() and this https://stackoverflow.com/questions/19324453/add-missing-dates-to-pandas-dataframe to fill in missing dates. you can use apply() with groupby() to do add missing dates for each group. once grouped and added dates, you have to iterate and choose every three rows: use something like df.Value.tolist() for your Valuelist columns – Jonathan Leon May 09 '21 at 02:05
  • @JonathanLeon: Thanks for the reply. Would you be able to share other examples too. – Roy May 09 '21 at 02:42
  • 1
    unfortunately not at this time. you have a lot of individual asks within the question. start by searching groupby and apply to learn how to iterate and apply functions. I would suggest trying on your own and breaking this up with questions on each part of the process showing where you've tried. Folks are much more apt to help modify code than just provide it. – Jonathan Leon May 09 '21 at 02:55
  • I have added the logic. There might be a better solution to the one I have provided but the logic will remain the same. – Pygirl May 09 '21 at 08:39

2 Answers2

0

Let's start with simple logic. So basically you want to create date range from Feb 2020 to Apr 2021 for each group.

let's take each group and add this date range using reindex. Once I am done with adding the date range now I will fill the data and then will perform rolling function to get the 3 consecutive values(considering previous one and current one) and convert it into the list.

I will assign these list of list values to my ValueList column. Then I will add all these modified groups to dataframe.

Solution:

df.Month = pd.to_datetime(df.Month, format="%Y-%m")
df.set_index('Month',inplace=True)

def add_elem(li): # this is to add 0 elements if rolling function is not getting 2 previous rows. 
    n = (3-len(li))
    if n<3:
        li = [0]*n +li
    return li


start = '2020-02'
end = '2021-04'

data = pd.DataFrame()
for i,grp in df.groupby(['Disease', 'State']):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    data = data.append(grp)

OR

using apply:

def fill_date(grp):
    grp = (grp.reindex(pd.date_range(start=start, end=end, freq="MS")))
    grp[['Disease', 'State']] = grp[['Disease', 'State']].bfill().ffill()
    grp = (grp.fillna(0))
    grp['Value'] = grp['Value'].astype(int)
    grp['ValueList'] = ([add_elem(window.to_list()) for window in grp['Value'].rolling(3)])
    return grp

 data = df.groupby(['Disease', 'State'], as_index=False).apply(fill_date)

data:

Disease State Value ValueList
2020-02-01 Cancer Florida 0 [0, 0, 0]
2020-03-01 Cancer Florida 0 [0, 0, 0]
2020-04-01 Cancer Florida 4 [0, 0, 4]
2020-05-01 Cancer Florida 0 [0, 4, 0]
2020-06-01 Cancer Florida 0 [4, 0, 0]
2020-07-01 Cancer Florida 0 [0, 0, 0]
2020-08-01 Cancer Florida 0 [0, 0, 0]
2020-09-01 Cancer Florida 0 [0, 0, 0]
2020-10-01 Cancer Florida 0 [0, 0, 0]
2020-11-01 Cancer Florida 0 [0, 0, 0]
2020-12-01 Cancer Florida 0 [0, 0, 0]
2021-01-01 Cancer Florida 0 [0, 0, 0]
2021-02-01 Cancer Florida 0 [0, 0, 0]
2021-03-01 Cancer Florida 0 [0, 0, 0]
2021-04-01 Cancer Florida 0 [0, 0, 0]
2020-02-01 Covid Florida 0 [0, 0, 0]
2020-03-01 Covid Florida 6 [0, 0, 6]
2020-04-01 Covid Florida 4 [0, 6, 4]
2020-05-01 Covid Florida 0 [6, 4, 0]
2020-06-01 Covid Florida 0 [4, 0, 0]
2020-07-01 Covid Florida 0 [0, 0, 0]
2020-08-01 Covid Florida 0 [0, 0, 0]
2020-09-01 Covid Florida 0 [0, 0, 0]
2020-10-01 Covid Florida 0 [0, 0, 0]
2020-11-01 Covid Florida 0 [0, 0, 0]
2020-12-01 Covid Florida 0 [0, 0, 0]
2021-01-01 Covid Florida 0 [0, 0, 0]
2021-02-01 Covid Florida 0 [0, 0, 0]
2021-03-01 Covid Florida 0 [0, 0, 0]
2021-04-01 Covid Florida 0 [0, 0, 0]
2020-02-01 Covid Texas 0 [0, 0, 0]
2020-03-01 Covid Texas 2 [0, 0, 2]
2020-04-01 Covid Texas 3 [0, 2, 3]
2020-05-01 Covid Texas 4 [2, 3, 4]
2020-06-01 Covid Texas 0 [3, 4, 0]
2020-07-01 Covid Texas 0 [4, 0, 0]
2020-08-01 Covid Texas 3 [0, 0, 3]
2020-09-01 Covid Texas 0 [0, 3, 0]
2020-10-01 Covid Texas 0 [3, 0, 0]
2020-11-01 Covid Texas 0 [0, 0, 0]
2020-12-01 Covid Texas 0 [0, 0, 0]
2021-01-01 Covid Texas 0 [0, 0, 0]
2021-02-01 Covid Texas 0 [0, 0, 0]
2021-03-01 Covid Texas 0 [0, 0, 0]
2021-04-01 Covid Texas 0 [0, 0, 0]
2020-02-01 Flu Florida 0 [0, 0, 0]
2020-03-01 Flu Florida 5 [0, 0, 5]
2020-04-01 Flu Florida 0 [0, 5, 0]
2020-05-01 Flu Florida 0 [5, 0, 0]
2020-06-01 Flu Florida 0 [0, 0, 0]
2020-07-01 Flu Florida 0 [0, 0, 0]
2020-08-01 Flu Florida 0 [0, 0, 0]
2020-09-01 Flu Florida 0 [0, 0, 0]
2020-10-01 Flu Florida 0 [0, 0, 0]
2020-11-01 Flu Florida 0 [0, 0, 0]
2020-12-01 Flu Florida 0 [0, 0, 0]
2021-01-01 Flu Florida 0 [0, 0, 0]
2021-02-01 Flu Florida 0 [0, 0, 0]
2021-03-01 Flu Florida 0 [0, 0, 0]
2021-04-01 Flu Florida 0 [0, 0, 0]
Pygirl
  • 12,969
  • 5
  • 30
  • 43
  • 1
    Hi @Pygirl, thank you again for helping with detailed explanation. Finding the TypeError: Passing PeriodDtype data is invalid. Use `data.to_timestamp()` instead – Roy May 09 '21 at 10:40
  • @Roy: refer to this: https://stackoverflow.com/questions/59316865/typeerror-passing-perioddtype-data-is-invalid-use-data-to-timestamp-instea – Pygirl May 09 '21 at 14:37
0

You can use pandas.date_range() to generate a list of dates between 2020 Feb to 2021 April.

dates = pd.date_range('2020-02', '2021-04', freq='MS').strftime('%Y-%m')

Then groupby by Disease and State column and fill the missing part in each group.

def fill_missing(group):
    group = group.merge(pd.DataFrame({'Month': dates}), how='right')
    group[['Disease', 'State']] = group[['Disease', 'State']].ffill().bfill()
    group['Value'] = group['Value'].fillna(0)

    group['ValueList'] = [[a, b, c] for a, b, c in zip(group['Value'].astype(int), group['Value'].shift(-1).fillna(0).astype(int), group['Value'].shift(-2).fillna(0).astype(int))]

    return group

df_ = df.groupby(['Disease', 'State']).apply(fill_missing).reset_index(drop=True)
print(df_)

   Disease    State    Month  Value  ValueList
0   Cancer  Florida  2020-02    0.0  [0, 0, 4]
1   Cancer  Florida  2020-03    0.0  [0, 4, 0]
2   Cancer  Florida  2020-04    4.0  [4, 0, 0]
3   Cancer  Florida  2020-05    0.0  [0, 0, 0]
4   Cancer  Florida  2020-06    0.0  [0, 0, 0]
5   Cancer  Florida  2020-07    0.0  [0, 0, 0]
6   Cancer  Florida  2020-08    0.0  [0, 0, 0]
7   Cancer  Florida  2020-09    0.0  [0, 0, 0]
8   Cancer  Florida  2020-10    0.0  [0, 0, 0]
9   Cancer  Florida  2020-11    0.0  [0, 0, 0]
10  Cancer  Florida  2020-12    0.0  [0, 0, 0]
11  Cancer  Florida  2021-01    0.0  [0, 0, 0]
12  Cancer  Florida  2021-02    0.0  [0, 0, 0]
13  Cancer  Florida  2021-03    0.0  [0, 0, 0]
14  Cancer  Florida  2021-04    0.0  [0, 0, 0]
15   Covid  Florida  2020-02    0.0  [0, 6, 4]
16   Covid  Florida  2020-03    6.0  [6, 4, 0]
17   Covid  Florida  2020-04    4.0  [4, 0, 0]
18   Covid  Florida  2020-05    0.0  [0, 0, 0]
19   Covid  Florida  2020-06    0.0  [0, 0, 0]
20   Covid  Florida  2020-07    0.0  [0, 0, 0]
21   Covid  Florida  2020-08    0.0  [0, 0, 0]
22   Covid  Florida  2020-09    0.0  [0, 0, 0]
23   Covid  Florida  2020-10    0.0  [0, 0, 0]
24   Covid  Florida  2020-11    0.0  [0, 0, 0]
25   Covid  Florida  2020-12    0.0  [0, 0, 0]
26   Covid  Florida  2021-01    0.0  [0, 0, 0]
27   Covid  Florida  2021-02    0.0  [0, 0, 0]
28   Covid  Florida  2021-03    0.0  [0, 0, 0]
29   Covid  Florida  2021-04    0.0  [0, 0, 0]
30   Covid    Texas  2020-02    0.0  [0, 2, 3]
31   Covid    Texas  2020-03    2.0  [2, 3, 4]
32   Covid    Texas  2020-04    3.0  [3, 4, 0]
33   Covid    Texas  2020-05    4.0  [4, 0, 0]
34   Covid    Texas  2020-06    0.0  [0, 0, 3]
35   Covid    Texas  2020-07    0.0  [0, 3, 0]
36   Covid    Texas  2020-08    3.0  [3, 0, 0]
37   Covid    Texas  2020-09    0.0  [0, 0, 0]
38   Covid    Texas  2020-10    0.0  [0, 0, 0]
39   Covid    Texas  2020-11    0.0  [0, 0, 0]
40   Covid    Texas  2020-12    0.0  [0, 0, 0]
41   Covid    Texas  2021-01    0.0  [0, 0, 0]
42   Covid    Texas  2021-02    0.0  [0, 0, 0]
43   Covid    Texas  2021-03    0.0  [0, 0, 0]
44   Covid    Texas  2021-04    0.0  [0, 0, 0]
45     Flu  Florida  2020-02    0.0  [0, 5, 0]
46     Flu  Florida  2020-03    5.0  [5, 0, 0]
47     Flu  Florida  2020-04    0.0  [0, 0, 0]
48     Flu  Florida  2020-05    0.0  [0, 0, 0]
49     Flu  Florida  2020-06    0.0  [0, 0, 0]
50     Flu  Florida  2020-07    0.0  [0, 0, 0]
51     Flu  Florida  2020-08    0.0  [0, 0, 0]
52     Flu  Florida  2020-09    0.0  [0, 0, 0]
53     Flu  Florida  2020-10    0.0  [0, 0, 0]
54     Flu  Florida  2020-11    0.0  [0, 0, 0]
55     Flu  Florida  2020-12    0.0  [0, 0, 0]
56     Flu  Florida  2021-01    0.0  [0, 0, 0]
57     Flu  Florida  2021-02    0.0  [0, 0, 0]
58     Flu  Florida  2021-03    0.0  [0, 0, 0]
59     Flu  Florida  2021-04    0.0  [0, 0, 0]
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • Hi @Ynjxsjmh. Thank you so much. The logic is really impressive. Here, I am finding ValueError: You are trying to merge on period[M] and object columns. If you wish to proceed you should use pd.concat – Roy May 09 '21 at 10:41
  • @Roy May be converting your `Month` column to string with `df['Month'] = df['Month'].astype(str)`. – Ynjxsjmh May 09 '21 at 12:03