0

I'm trying to dynamically build a format in which I want to display number of deposits compared to withdrawals in a timeline chart. Whenever a deposit is done, the graph will go up, and when a withdrawal is done the graph goes down.

This is how far I've gotten:

df.head()

name    Deposits    Withdrawals

Peter   2019-03-07  2019-03-11
Peter   2019-03-08  2019-03-19
Peter   2019-03-12  2019-05-22
Peter   2019-03-12  2019-10-31
Peter   2019-03-14  2019-04-05

Here is the data manipulation to show the net movements for one person; Peter.

x = pd.Series(df.groupby('Deposits').size())
y = pd.Series(df.groupby('Withdrawals').size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(Peter=balance.net_mov.cumsum())

print(balance)

            net_mov  Peter
2019-03-07        1      1
2019-03-08        1      2
2019-03-11       -1      1
2019-03-12        2      3
2019-03-14        1      4

This works perfectly fine, and this is the format that I want to have. Now let's say I want to extend on this and not just list Peters deposits and withdrawals, but I want to add n-number of people. Lets assume that my dataframe looks like this:


df2.head()

name    Deposits    Withdrawals

Peter   2019-03-07  2019-03-11
Anna    2019-03-08  2019-03-19
Anna    2019-03-12  2019-05-22
Peter   2019-03-12  2019-10-31
Simon   2019-03-14  2019-04-05

The format I'm aiming for is this. I don't know how to group everything, and I don't know which names or how many columns there will be beforehand, so I can't hardcode names or number of columns. It has to be generate dynamically.

            net_mov1  Peter   net_mov2   Anna    net_mov3  Simon   
2019-03-07        1      1           1      1           2      2
2019-03-08        1      2           2      3          -1      1
2019-03-11       -1      1           0      3           2      3
2019-03-12        2      3          -2      1           4      7
2019-03-14        1      4           3      4          -1      6

UPDATE:

First off, thanks for the help. I'm getting closer to my goal. This is the progress:

x = pd.Series(df.groupby(['Created', 'name']).size())
y = pd.Series(df.groupby(['Finished', 'name']).size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(balance=balance.groupby('name').net_mov.cumsum())

balance_byname = balance.groupby('name')
balance_byname.get_group("Peter")

Output:

                                                       net_mov  balance
name                       Created    Finished                    
Peter                      2017-07-03 2017-07-06        1        1
                                      2017-07-10        1        2
                                      2017-07-13        0        2
                                      2017-07-14        1        3
...                                                   ...      ...
                           2020-07-29 2020-07-15        0     4581
                                      2020-07-17        0     4581
                                      2020-07-20        0     4581
                                      2020-07-21       -1     4580

[399750 rows x 2 columns]

This is of course too many rows, the dataset I'm working with has around 2500 rows.

I've tried to unstack it but that creates problems on it's own.

klabbaparn
  • 157
  • 3
  • 9
  • 1
    See how to provide example data that is easy for other to use https://stackoverflow.com/q/20109391/6692898, your question is definitely better explained than yesterdays, getting better – RichieV Jul 30 '20 at 13:54
  • When you get an answer that satisfies your question https://stackoverflow.com/help/someone-answers – RichieV Jul 31 '20 at 13:43

2 Answers2

1

Given df:

name    Deposits    Withdrawals
Peter   2019-03-07  2019-03-11
Anna    2019-03-08  2019-03-19
Anna    2019-03-12  2019-05-22
Peter   2019-03-12  2019-10-31
Simon   2019-03-14  2019-04-05

You can melt dataframe, indicate deposits by 1 and withdravals by -1, and then pivot:

df = pd.DataFrame(\
{'name': {0: 'Peter', 1: 'Anna', 2: 'Anna', 3: 'Peter', 4: 'Simon'},
 'Deposits': {0: '2019-03-07',
  1: '2019-03-08',
  2: '2019-03-12',
  3: '2019-03-12',
  4: '2019-03-14'},
 'Withdrawals': {0: '2019-03-11',
  1: '2019-03-19',
  2: '2019-05-22',
  3: '2019-10-31',
  4: '2019-04-05'}})

df2 = df.melt('name')\
        .assign(variable = lambda x: x.variable.map({'Deposits':1,'Withdrawals':-1}))\
        #.pivot('value','name','variable').fillna(0)\ 
        #use pivot_table with sum aggregate, because there may be duplicates in data
        .pivot_table('variable','value','name', aggfunc = 'sum').fillna(0)\
        .rename(columns = lambda c: f'{c} netmov' )

Above will give net change of balance:

name        Anna netmov  Peter netmov  Simon netmov
value                                              
2019-03-07          0.0           1.0           0.0
2019-03-08          1.0           0.0           0.0
2019-03-11          0.0          -1.0           0.0
2019-03-12          1.0           1.0           0.0
2019-03-14          0.0           0.0           1.0
2019-03-19         -1.0           0.0           0.0
2019-04-05          0.0           0.0          -1.0
2019-05-22         -1.0           0.0           0.0
2019-10-31          0.0          -1.0           0.0

Finally calculate balance using cumulative sum and concatenate it with previously calculated net changes:

df2 = pd.concat([df2,df2.cumsum().rename(columns = lambda c: c.split()[0] + ' balance')], axis = 1)\
        .sort_index(axis=1)

result:

name        Anna balance  Anna netmov  ...  Simon balance  Simon netmov
value                                  ...                             
2019-03-07           0.0          0.0  ...            0.0           0.0
2019-03-08           1.0          1.0  ...            0.0           0.0
2019-03-11           1.0          0.0  ...            0.0           0.0
2019-03-12           2.0          1.0  ...            0.0           0.0
2019-03-14           2.0          0.0  ...            1.0           1.0
2019-03-19           1.0         -1.0  ...            1.0           0.0
2019-04-05           1.0          0.0  ...            0.0          -1.0
2019-05-22           0.0         -1.0  ...            0.0           0.0
2019-10-31           0.0          0.0  ...            0.0           0.0

[9 rows x 6 columns]
ipj
  • 3,488
  • 1
  • 14
  • 18
  • This looks exactly like the format I'm looking for. The problem is that it's too complex for me to comprehend what is happening. I've tried to reproduce it and I'm getting an error on this command ---> .pivot('value','name','variable').fillna(0)\ ValueError: Index contains duplicate entries, cannot reshape – klabbaparn Jul 31 '20 at 13:48
  • Did You use exactly the same data as in example of `df`? – ipj Jul 31 '20 at 13:52
  • I've added to answer reproducible definition of `df` from dictionary form. Run the code exactly as provided. Do You see any errors now? – ipj Jul 31 '20 at 14:02
  • You're correct, the code works perfectly fine. The error I'm getting was with the "real" data I'm working with. – klabbaparn Jul 31 '20 at 14:05
  • Nice to hear it! If it's ok consider accepting it as an answer please. – ipj Jul 31 '20 at 14:48
  • Done and done! Do you have any slight idea why I'm getting the error with the real data?Surely there are much more data but the format is pretty much the same. It does however contains some NaN-values that might cause the problem. – klabbaparn Jul 31 '20 at 15:04
  • No idea. You must find fragment of data that causes error. Maybe there are some duplicates in data. – ipj Jul 31 '20 at 15:06
  • There are many duplicates in all of the column, names, Deposits and Withdrawals :/ – klabbaparn Jul 31 '20 at 15:09
  • 1
    I've improved my solution to handle duplicates is data. Now instead of `pivot` I use `pivot_tables` with aggregate function `sum`. See If it's ok, please. – ipj Aug 03 '20 at 09:07
  • Yes, it works even with the huge data sample that I had. – klabbaparn Aug 03 '20 at 13:55
1

Try making use of pandas MultiIndex. This is almost the same code copied from your question BUT

  • including the column name into the groupby argument
  • adding a .groupby('name') call in the last line

With the code:

x = pd.Series(df.groupby(['Deposits', 'name']).size())
y = pd.Series(df.groupby(['Withdrawals', 'name']).size())
balance = pd.DataFrame({'net_mov': x.sub(y, fill_value=0)})
balance = balance.assign(balance=balance.groupby('name').net_mov.cumsum())

The groupby in the lastline effectively tells pandas to treat each name as a separate dataframe before applying cumsum, so movements will be kept to each account.


Now you can keep it in this shape with only two columns and the name as a second level in the rows MultiIndex. You can set a groupby object by calling

balance_byname = balance.groupby('name') # notice there is no aggregation nor transformation

To be used whenever you need to access only one account with .get_group() https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.get_group.html#pandas.core.groupby.GroupBy.get_group

OR---

You can also add a new line at the end

balance = balance.unstack('name')

Which will give a shape similar to what you ask in the expected output. This will, however, possibly create a number of 'NaN' by having all dates by all names. This can drastically increase the memory usage IF there are many dates and many bames, with each name having movements only in a few dates.

RichieV
  • 5,103
  • 2
  • 11
  • 24
  • Pandas is really powerful, there seems to be a function for everything. Your suggestion made me closer to the goal, I'm just having some stacked/unstack problems to sort out now. – klabbaparn Jul 31 '20 at 13:40