I have two dataframes and I want to subtract counter1 with counter2. Preferably with a added column like 'diff'.
This is my attempt so far:
import pandas as pd
import numpy as np
file = ('data.csv')
df = pd.read_csv(file)
df = df[['Release', 'Created Date', 'Finished Date']]
x = df.groupby(['Release', 'Created Date'])['Created Date'].size().to_frame('size1')
y = df.groupby(['Release', 'Finished Date'])['Finished Date'].count().to_frame('size2')
x['counter1'] = x.groupby('Release').size1.cumsum().to_frame().sort_values('Created Date')
y['counter2'] = y.groupby('Release').size2.cumsum().to_frame().sort_values('Finished Date')
print(x)
print(y)
Output for X:
size1 counter1
Release Created Date
Sony 2020-07-09 1 1
2020-07-14 1 2
Sega 2020-06-30 1 1
2020-07-09 1 2
2020-07-13 1 3
2020-07-14 1 4
2020-07-15 2 6
2020-07-17 2 8
2020-07-21 1 9
Nintendo 2020-06-29 1 1
2020-07-01 2 3
2020-07-06 1 4
Output for y:
size2 counter2
Release Finished Date
Sony 2020-07-17 1 1
2020-07-20 1 2
Sony 2020-07-03 1 1
2020-07-13 1 2
2020-07-17 1 3
2020-07-20 1 4
2020-07-23 3 7
2020-07-24 1 8
2020-07-28 1 9
Nintendo 2020-07-09 1 1
2020-07-10 1 2
2020-07-15 1 3
This is my attempt, but the result is very confusing and certainly not correct:
t = x['counter1'] - y['counter2']
Had to remove output because of code compared to text was too much, but the output was very strange anyway.
EDIT.
print(df)
output:
Release Created Date Finished Date
0 Sony 2020-07-21 2020-07-23
1 Sony 2020-07-17 2020-07-28
2 Sony 2020-07-17 2020-07-23
3 Sony 2020-07-15 2020-07-17
4 Sony 2020-07-15 2020-07-24
.. ... ... ...
76 Sony 2020-06-02 2020-06-04
77 Sega 2020-06-01 2020-06-12
79 Sega 2020-06-01 2020-07-22
80 Sony 2020-06-01 2020-06-16
81 Nintendo 2020-06-01 2020-07-16
The goal is to make a dataset for a timeline chart, in which the dates are on the x-axis and when a Release is created it should go up on the y-axis, and then when it's finished it should go down on the y-axis.
Maybe I'm making it too complicated.
Update:
The help I got from a fellow community member made me accomplish the goal, much appreciated. Now I want to build upon that and create a multi timeline chart, with multiple releases shown in the same graph.
Here is the working solution for a single timeline chart.
// This is how I managed to get it working for a single release, but this will eventually become a problem later when I want all the releases.
df = df[df['Release'].str.contains("Sony")]
deposits = pd.Series(df.groupby('Created').size())
withdrawals = pd.Series(df.groupby('Finished').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})
balance = balance.assign(active=balance.net_movements.cumsum())
balance = balance.rename(columns={"active": "Sony"})
print(balance)
Output:
net_movements Sony
2020-06-01 3.0 3.0
2020-06-02 2.0 5.0
2020-06-03 2.0 7.0
2020-06-04 -1.0 6.0
2020-06-05 0.0 6.0
2020-06-08 1.0 7.0
We can remove the net_movements and make the final format complete:
balance = balance.drop(['net_movements'], axis=1)
print(balance)
Sony
2020-06-01 3.0
2020-06-02 5.0
2020-06-03 7.0
2020-06-04 6.0
2020-06-05 6.0
2020-06-08 7.0
This solved my problem with displaying a single release. Now I want to build upon this and display all releases in the same graph.
Here is my attempt:
deposits = pd.Series(df.groupby(['Release', 'Created']).size())
print(deposits)
output: (shortened down)
Release Created
Sega 2020-06-01 1
2020-06-04 1
2020-07-14 1
Nintendo 2020-06-01 3
2020-06-02 2
2020-06-03 2
withdrawals = pd.Series(df.groupby(['Release', 'Finished']).size())
print(withdrawals)
Release Finished
Sony 2020-06-12 1
2020-06-16 2
2020-06-18 1
Nintendo 2020-06-04 1
2020-06-05 1
2020-06-16 2
Now, here is where it gets complicated. Not only does the columns appear everywhere, but the active column doesn't reset when it reaches a new release, it keep ticking over the releases.
balance = balance.assign(active=balance.net_movements.cumsum())
print(balance)
Output:
net_movements active
Release Created Finished
Sony 2020-06-01 2020-06-12 1 1
2020-06-16 2 3
2020-06-18 0 3
Nintendo 2020-06-04 2020-06-12 -1 2
2020-06-16 1 3
2020-06-18 0 3
Wanted format (with dummy values) :
Sony Nintendo
2020-06-01 3.0 4.0
2020-06-02 5.0 5.0
2020-06-03 7.0 2.0
2020-06-04 6.0 4.0
2020-06-05 6.0 4.0
2020-06-08 7.0 7.0
It's hard asking the correct question with as little information as possible, but at the same time not to little, this one got a bit long, but hopefully I explained my goal and problem reasonably well.