2

Given a dataframe containing columns related to several months in a year, I need to perform a series of identical operations on per-month slices of this dataframe and output dataframes for each month period containing the original values along with columns containing the values returned from the operations performed on the slice in question.

Question is how to set up these slices (after which the operations can be performed on the dataframes) without having to define each dataframe related to a particular month (the solution needs to be flexible to accommodate for changing periods).

Input:

df_a_number


Index   21-Nov  21-Dec  22-Jan  22-Feb
John    2       3       1       5   
Anna    1       4       3       8

df_b_letter


Index   21-Nov  21-Dec  22-Jan  22-Feb
John    a       f       j       p   
Anna    b       b       w       g

Desired output:

df_dec


Index   21-Nov  21-Dec  21-Dec-Diff 21-Dec-Letter
John    2       3       1           f           
Anna    1       4       3           b

df_jan

Index   21-Dec  21-Jan  22-Jan-Diff 22-Jan-Letter
John    3       1       -2          j   
Anna    4       3       -1          w

And so forth for Feb...

In addition to performing the calculation in the -Diff column and the merge in the -Letter column, the df needs to be filtered also before outputting to file.

Any suggestions or ideas would be much appreciated.

Zaphod
  • 147
  • 7

2 Answers2

2

First use concat with difference by DataFrame.diff:

df = pd.concat([df_a_number, 
                df_a_number.diff(axis=1).iloc[:, 1:].add_suffix('-Diff'), 
                df_b_letter.iloc[:, 1:].add_suffix('-Letter')], axis=1)

print (df)
       21-Nov  21-Dec  22-Jan  22-Feb  21-Dec-Diff  22-Jan-Diff  22-Feb-Diff  \
Index                                                                          
John        2       3       1       5            1           -2            4   
Anna        1       4       3       8            3           -1            5   

      21-Dec-Letter 22-Jan-Letter 22-Feb-Letter  
Index                                            
John              f             j             p  
Anna              b             w             g  

Solution with append previous month for dictionary of DataFrames:

d = {b: df.filter(regex=rf'^{a}$|{b}') for a, b 
                                       in zip(df_a_number.columns, df_a_number.columns[1:])}
# print (d)


print (d['21-Dec'])
       21-Nov  21-Dec  21-Dec-Diff 21-Dec-Letter
Index                                           
John        2       3            1             f
Anna        1       4            3             b

print (d['22-Jan'])

       21-Dec  22-Jan  22-Jan-Diff 22-Jan-Letter
Index                                           
John        3       1           -2             j
Anna        4       3           -1             w

EDIT: Possible solution, but not recommended with globals:

for a, b in zip(df_a_number.columns, df_a_number.columns[1:]):
    globals()[f'df_{b}'.lower().replace('-','_')] = df.filter(regex=rf'^{a}$|{b}')


print (df_21_dec)
       21-Nov  21-Dec  21-Dec-Diff 21-Dec-Letter
Index                                           
John        2       3            1             f
Anna        1       4            3             b
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Opted for the dict comprehension. However, the "solution with append previous month for dictionary" will also create dfs (ie. keys) such as `d[21-Dec-Diff]`. How to most easily not set up those dfs when doing the dict comprehension? – Zaphod Mar 23 '23 at 10:16
  • @Zaphod - what is `print(df_a_number.columns)` in real data? What is difference with data in question? – jezrael Mar 23 '23 at 10:19
  • aye aye sir how are you doin?(sorry for spamming and out of context comment) – Anurag Dabas Mar 23 '23 at 12:49
  • @AnuragDabas - thank you for asking, busy a bit ;) – jezrael Mar 23 '23 at 12:50
  • 1
    @jezrael sounds good sir...it has been great to see you here :) – Anurag Dabas Mar 23 '23 at 12:53
  • 1
    @jezrael It was a mistake on my part zipping `df`instead of `df_a_number`. Fixed now! – Zaphod Mar 24 '23 at 07:55
1

Use concat combined with add_suffix and diff for the difference, then sort_index with a custom key:

out = (pd.concat(
       [df_a_number.diff(axis=1).iloc[:, 1:].add_suffix('-Diff'),
        df_b_letter.add_suffix('-Letter'), df_a_number],
                 axis=1)
         .sort_index(axis=1, kind='stable',
                     key=lambda x: pd.to_datetime(x.str.extract(r'(\d+-[^-]+)', expand=False), format='%d-%b'))
      )

Output:

       22-Jan-Diff 22-Jan-Letter  22-Jan  22-Feb-Diff 22-Feb-Letter  22-Feb 21-Nov-Letter  21-Nov  21-Dec-Diff 21-Dec-Letter  21-Dec
Index                                                                                                                               
John            -2             j       1            4             p       5             a       2            1             f       3
Anna            -1             w       3            5             g       8             b       1            3             b       4
as separate dataframes

You further need to slice the output, you can use a dictionary comprehension:

dfs = {c.split('-')[-1].lower(): out.loc[:, (m:=out.columns.to_series().str.startswith(c))|m.shift(-1)]
       for c in df_b_letter}

Output:

dfs['feb']

       22-Jan  22-Feb-Diff 22-Feb-Letter  22-Feb
Index                                           
John        1            4             p       5
Anna        3            5             g       8

dfs['dec']
       21-Nov  21-Dec-Diff 21-Dec-Letter  21-Dec
Index                                           
John        2            1             f       3
Anna        1            3             b       4

Note that this takes the next available month, and if you have several years this won't work as you don't have the year information. In this case I would recommend to use a MultiIndex with the year/month/category information.

mozway
  • 194,879
  • 13
  • 39
  • 75