1

I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:

Date1     Date2     Date3     Date4
1         1             2         3
N/A       1             2         2

The first row's first instance is Date1. The second row's first instance is Date2.

The new table would be formatted such that the columns are now the months relative to the first action and would look like:

FirstMonth     SecondMonth     ThirdMonth
1                 1             2
1                 2             2

Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!

adam.ziolk
  • 11
  • 1
  • `df.diff(axis=1)`? – Quang Hoang Oct 29 '19 at 17:39
  • So going from that first table to your second is the question? – ALollz Oct 29 '19 at 17:42
  • @ALollz, correct. I currently have the first table and I'm looking to transform it into the second one. – adam.ziolk Oct 29 '19 at 18:05
  • I think anything I come up with would be basically a copy of https://stackoverflow.com/questions/47897166/left-justify-string-values-in-a-pandas-dataframe. applying `sorted` is a concise, but slow solution. Otherwise you can use the `justify` solution, something like: `pd.DataFrame(justify(df.to_numpy(), invalid_val = 'N/A'))` – ALollz Oct 29 '19 at 18:16

2 Answers2

0

First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:

>>>df

2019-01-01  2019-01-02  2019-02-01
Row
0             4          22          40
1            22          67          86
2            72          27          25
3             0          26          60
4            44          62          32
5            73          86          81
6            81          17          58
7            88          29          21
>>>df.sum().groupby(df.sum().index.month).sum()

1    720
2    403

And if you want it to reflect what you had above:

>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
   January  February
0      720      403

And if I misunderstood you, and you want it broken out by record / row:

>>> df.T.groupby(df.T.index.month).sum().T
1   2
Row
0     26  40
1     89  86
2     99  25
3     26  60
4    106  32
5    159  81
6     98  58
7    117  21

Rename the columns as above.

Adam Zeldin
  • 898
  • 4
  • 6
  • So, I'm actually trying to keep the original data unaggregated. The end state is that I'll have a table where each column is the first/second/third month, so I can look at trends across each account over the same relative time period (i.e. between month1 and month2, accounts are increasing activity. Months2 to month3 they decrease activity, etc). – adam.ziolk Oct 29 '19 at 18:08
  • Did you see the update at the bottom? I believe it accomplishes what you’re looking for, gives the monthly totals by row. `df.T.groupby(df.T.index.month).sum().T` – Adam Zeldin Oct 29 '19 at 18:29
0

The trick is to use .apply() combined with dropna().

df.T.apply(lambda x: pd.Series(x.dropna().values)).T
adam.ziolk
  • 11
  • 1