Pandas subtract dataframes where column matches

Question

I have two dataframes and I want to subtract counter1 with counter2. Preferably with a added column like 'diff'.

This is my attempt so far:

import pandas as pd
import numpy as np


file = ('data.csv')

df = pd.read_csv(file)
df = df[['Release', 'Created Date', 'Finished Date']]


x = df.groupby(['Release', 'Created Date'])['Created Date'].size().to_frame('size1')
y = df.groupby(['Release', 'Finished Date'])['Finished Date'].count().to_frame('size2')

x['counter1'] = x.groupby('Release').size1.cumsum().to_frame().sort_values('Created Date')
y['counter2'] = y.groupby('Release').size2.cumsum().to_frame().sort_values('Finished Date')
print(x)
print(y)

Output for X:
                                            size1  counter1
Release                       Created Date                 
Sony                          2020-07-09        1         1
                              2020-07-14        1         2
Sega                          2020-06-30        1         1
                              2020-07-09        1         2
                              2020-07-13        1         3
                              2020-07-14        1         4
                              2020-07-15        2         6
                              2020-07-17        2         8
                              2020-07-21        1         9
Nintendo                      2020-06-29        1         1
                              2020-07-01        2         3
                              2020-07-06        1         4


Output for y:

                                             size2  counter2
Release                       Finished Date                 
Sony                          2020-07-17         1         1
                              2020-07-20         1         2
Sony                          2020-07-03         1         1
                              2020-07-13         1         2
                              2020-07-17         1         3
                              2020-07-20         1         4
                              2020-07-23         3         7
                              2020-07-24         1         8
                              2020-07-28         1         9
Nintendo                      2020-07-09         1         1
                              2020-07-10         1         2
                              2020-07-15         1         3

This is my attempt, but the result is very confusing and certainly not correct:

t = x['counter1'] - y['counter2']

Had to remove output because of code compared to text was too much, but the output was very strange anyway.

EDIT.

print(df)

output:

Release        Created Date Finished Date
0   Sony       2020-07-21    2020-07-23
1   Sony       2020-07-17    2020-07-28
2   Sony       2020-07-17    2020-07-23
3   Sony       2020-07-15    2020-07-17
4   Sony       2020-07-15    2020-07-24
..                            ...          ...           ...
76  Sony       2020-06-02    2020-06-04
77  Sega       2020-06-01    2020-06-12
79  Sega       2020-06-01    2020-07-22
80  Sony       2020-06-01    2020-06-16
81  Nintendo   2020-06-01    2020-07-16

The goal is to make a dataset for a timeline chart, in which the dates are on the x-axis and when a Release is created it should go up on the y-axis, and then when it's finished it should go down on the y-axis.

Maybe I'm making it too complicated.

Update:

The help I got from a fellow community member made me accomplish the goal, much appreciated. Now I want to build upon that and create a multi timeline chart, with multiple releases shown in the same graph.

Here is the working solution for a single timeline chart.


// This is how I managed to get it working for a single release, but this will eventually become a problem later when I want all the releases.
df = df[df['Release'].str.contains("Sony")]


deposits = pd.Series(df.groupby('Created').size())
withdrawals = pd.Series(df.groupby('Finished').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})

balance = balance.assign(active=balance.net_movements.cumsum())
balance = balance.rename(columns={"active": "Sony"})

print(balance)

Output:

            net_movements  Sony
2020-06-01            3.0   3.0
2020-06-02            2.0   5.0
2020-06-03            2.0   7.0
2020-06-04           -1.0   6.0
2020-06-05            0.0   6.0
2020-06-08            1.0   7.0

We can remove the net_movements and make the final format complete:

balance = balance.drop(['net_movements'], axis=1)
print(balance)

             Sony
2020-06-01   3.0
2020-06-02   5.0
2020-06-03   7.0
2020-06-04   6.0
2020-06-05   6.0
2020-06-08   7.0

This solved my problem with displaying a single release. Now I want to build upon this and display all releases in the same graph.

Here is my attempt:


deposits = pd.Series(df.groupby(['Release', 'Created']).size())
print(deposits)

output: (shortened down)

Release                        Created   
Sega                           2020-06-01    1
                               2020-06-04    1
                               2020-07-14    1
Nintendo                       2020-06-01    3
                               2020-06-02    2
                               2020-06-03    2

withdrawals = pd.Series(df.groupby(['Release', 'Finished']).size())
print(withdrawals)

Release                        Finished  
Sony                           2020-06-12    1
                               2020-06-16    2
                               2020-06-18    1
Nintendo                       2020-06-04    1
                               2020-06-05    1
                               2020-06-16    2

Now, here is where it gets complicated. Not only does the columns appear everywhere, but the active column doesn't reset when it reaches a new release, it keep ticking over the releases.

balance = balance.assign(active=balance.net_movements.cumsum())
print(balance)

Output:

                                                         net_movements  active
Release                       Created    Finished                         
Sony                          2020-06-01 2020-06-12              1       1
                                         2020-06-16              2       3
                                         2020-06-18              0       3
Nintendo                      2020-06-04 2020-06-12             -1       2
                                         2020-06-16              1       3
                                         2020-06-18              0       3

Wanted format (with dummy values) :

             Sony     Nintendo
2020-06-01   3.0           4.0
2020-06-02   5.0           5.0
2020-06-03   7.0           2.0
2020-06-04   6.0           4.0
2020-06-05   6.0           4.0
2020-06-08   7.0           7.0

It's hard asking the correct question with as little information as possible, but at the same time not to little, this one got a bit long, but hopefully I explained my goal and problem reasonably well.

can you provide sample input data i.e. `df` after reading in the file, which we cannot see? — David Erickson, Jul 28 '20 at 22:38
take a look at this question that shows how to provide sample data... you might get better answers this way https://stackoverflow.com/q/20109391/6692898 — RichieV, Jul 29 '20 at 01:12

score 2 · Accepted Answer · answered Jul 29 '20 at 01:10

2

Basically you are looking for the "active releases count" at any given point in time. I would start by creating the timeline without any data and treat Created and Finished dates as deposits/withdrawals on a balance account.

timeline = pd.DateRange(df.Created.min(), df.Finished.max(), freq='D')
deposits = pd.Series(df.groupby('Created Date').size())
withdrawals = pd.Series(df.groupby('Finished Date').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})
balance = balance.reindex(timeline, fill_value=0)
balance = balance.assign(active=balance.net_movements.cumsum())

answered Jul 29 '20 at 01:10

RichieV

5,103
2
11
24

Exactly what I was looking for. This solved my problem in a very nice fashion. I'm new to SO but when I get reputation > 15 , I will upvote your answer! Thanks. – klabbaparn Jul 29 '20 at 10:24
HTH, and if you're having trouble with this kinds of analysis try writing pseudocode or doing it manually with a sample of your data, then replicate that with actual code. Pandas has great ideas in its well written user guide https://pandas.pydata.org/docs/user_guide/index.html – RichieV Jul 29 '20 at 13:29
Also if you want to accept this as an answer read https://stackoverflow.com/help/someone-answers – RichieV Jul 29 '20 at 13:31
The user guide with all the examples was very good. I made an update on my post and listed your solution as an accepted answer. The graph became very nice, and I was able to see the trending from active and closed releases. Now I want to build upon this code and add multiple releases in the same graph, as a multi-series timeline chart. I've spent entire day trying to solve it, and right now I'm at a crossroad. If you have a spare minute please see if you can see what I'm trying to do :) thanks again. – klabbaparn Jul 29 '20 at 19:22
I suggest you generalize what you are trying to do (try to write it as a book problem) and look for a solution here or in google... if you can't find anything then post another question, there are thousands of people that can help you here, but you have to make it easy for us to do it – RichieV Jul 29 '20 at 20:49
Yes, I should probably write a new question that starts from where I am now. I want to do the exactly same thing as I've done, but have to group them by releases instead. Like right now it tracks one persons deposit/withdrawals, but what I want to do is not to limit it to one person, but instead n-numbers of people. In the example you provided, the "active" column should instead be n-numbers of people and track their activity simultaneously. Gonna sleep on it and formulate a new question. – klabbaparn Jul 29 '20 at 21:44
New question related to this: https://stackoverflow.com/questions/63170686/pandas-deposits-and-withdrawals-over-a-time-period-with-n-number-of-people – klabbaparn Jul 30 '20 at 10:30

Pandas subtract dataframes where column matches

1 Answers1