0

I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?

The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.

Thanks a lot.

Example

[enter image description here]

ZspreDo
  • 1
  • 1
  • 2
    Hi , Welcome to SO..!! Please [do not paste images of your data](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question) , instead create a dummy example which can be reproduced by code. You can refer: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – anky Sep 03 '20 at 16:24
  • 1
    Sorry, when I put the image in the image editor it just pasted the link – ZspreDo Sep 03 '20 at 16:25
  • 1
    you can get your sample with `df.head(10).to_dict('list')`... I see the first row has diff `50`, and assume it is because the code should always subtract the previous date for that `Name-Expiry`, what if there is no data for `Mike-1Y` on date `8-29`, would you subtract whatever money he had on `8-28`? – RichieV Sep 03 '20 at 16:28
  • Thanks Richie. There will always be data for previous day as I am capturing last 7 days in the original dataframe. Edit: It is correct what you mentioned, it should always substract previous date for that – ZspreDo Sep 03 '20 at 16:47
  • @RichieV has suggested you the right way to get the data into a code fragment, kindly post the same so we can reproduce the data in our systems, thanks in advance. :) – anky Sep 03 '20 at 16:49
  • Hi anky, apologies, I cannot do that as the df relies in my company's virtual machine. What I've posted is pretty much it though – ZspreDo Sep 03 '20 at 16:53

2 Answers2

1
import pandas as pd
import numpy as np

example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
                                        '2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
                                'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
                                'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
                                'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})

example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()

example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])

example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])

Result:

          Day  Name  Money_x Expiry  Difference
0  2020-08-30  John      100     1Y          50
1  2020-08-30  Mike      950     1Y         900
2  2020-08-30  John      200     2Y         -50
3  2020-08-30  Mike     1000     2Y        -200

If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.

  • there's probably an easier way to go about this, or you can do the above in a function to make a more scalable by defining the date as a variable. – angelsarebeautiful Sep 03 '20 at 16:59
0

You can solve it with groupby.diff

Take the dataframe

df = pd.DataFrame({
    'Day': [30, 30, 30, 30, 29, 29, 28, 28],
    'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
    'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
    'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)

Which looks like

   Day  Name  Money  Expiry
0   30  John    100       1
1   30  Mike    950       1
2   30  John    200       2
3   30  Mike   1000       2
4   29  John     50       1
5   29  Mike     50       1
6   28  John    250       2
7   28  Mike   1200       2

And the code

# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)

# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)

Output

   Day  Name  Money  Expiry  Difference
0   30  John    100       1        50.0
1   30  Mike    950       1       900.0
2   30  John    200       2       -50.0
3   30  Mike   1000       2      -200.0
4   29  John     50       1         NaN
5   29  Mike     50       1         NaN
6   28  John    250       2         NaN
7   28  Mike   1200       2         NaN
RichieV
  • 5,103
  • 2
  • 11
  • 24