0

I have the following df:

df = pd.DataFrame({'block': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                   'B': ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'],
                   'C': [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000],
                   })

Which outputs:

  block  B     C
0     1  a  1000
1     1  b  2000
2     1  c  3000
3     2  a  4000
4     2  b  5000
5     2  c  6000
6     3  a  7000
7     3  b  8000
8     3  c  9000

I want to output a diff_column (could also be an entirely new_df) who's values are the difference between the values in the C column grouped by the B column for every different value in the block column.

That is, what's the difference between a's value in block 2 and a's value in block 1?

Example for a:

4000 - 1000 = 3000

Note, the rows for block == 1 would be empty since there is no other previous block.

I have tried:

df['diff_column'] = df.groupby(['block', 'B'])['C'] - df.shift[-1].groupby(['block', 'B'])['C']

with no success.

Luiz Scheuer
  • 305
  • 1
  • 10

2 Answers2

2

Assuming the blocks are sorted, you can use groupby.diff with column B as grouper:

df['diff'] = df.groupby('B')['C'].diff()

With groupby.shift as you initially tried to do:

df['diff'] = df['C']-df.groupby('B')['C'].shift()

output:

   block  B     C    diff
0      1  a  1000     NaN
1      1  b  2000     NaN
2      1  c  3000     NaN
3      2  a  4000  3000.0  # block 2/a - block 1/a
4      2  b  5000  3000.0
5      2  c  6000  3000.0
6      3  a  7000  3000.0
7      3  b  8000  3000.0
8      3  c  9000  3000.0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks! Why is there no need to specify column `block` anywhere? I'm assuming this has to do with your assumption that `block` is indeed sorted. Blocks will always increase from top to bottom. – Luiz Scheuer Oct 11 '22 at 09:45
  • Because if you group by block, you won't have values to compare (all your groups will be 1 item). Try to run `list(df.groupby('B'))` to see the groups. The `block` is implicitly there due to the order of the rows ;) – mozway Oct 11 '22 at 09:47
  • 2
    `diff()` is the even better solution. Thanks for reminding mozway :-) – Marco_CH Oct 11 '22 at 09:48
  • Thanks guys! What if I have even more columns I want to group by, instead of just column B? Will it calculate the difference between those groupings between the blocks too? – Luiz Scheuer Oct 11 '22 at 09:54
2

You could simply do:

# group by 'B' and use shift(1) to get previous value, then subtract from 'C' only
df["diff_column"] = df["C"] - df.groupby(["B"])["C"].shift(1)

+----+---------+-----+------+---------------+
|    |   block | B   |    C |   diff_column |
|----+---------+-----+------+---------------|
|  0 |       1 | a   | 1000 |           nan |
|  1 |       1 | b   | 2000 |           nan |
|  2 |       1 | c   | 3000 |           nan |
|  3 |       2 | a   | 4000 |          3000 |
|  4 |       2 | b   | 5000 |          3000 |
|  5 |       2 | c   | 6000 |          3000 |
|  6 |       3 | a   | 7000 |          3000 |
|  7 |       3 | b   | 8000 |          3000 |
|  8 |       3 | c   | 9000 |          3000 |
+----+---------+-----+------+---------------+
Marco_CH
  • 3,243
  • 8
  • 25