2

Description

How do you use Pandas groupby to group certain columns, but not others?

Current Progress

table_D = pd.DataFrame({
    'Geo_ID': [1, 1, 1, 1, 2, 3, 4, 4, 5],
    'A_Code': [12, 12, 12, 65, 65, 65, 65, 98, 98],
    'A_Cost': [2, 9, 1, 10, 6, 7, 7, 6, 2],
}, columns=['Geo_ID', 'A_Code', 'A_Cost'])
table_D_dummies = pd.get_dummies(data = table_D, columns = ["A_Code"])
table_D_dummies_grouped = table_D_dummies.groupby(by = ["Geo_ID"]).sum()

Problem

As shown below, this correctly sums cost by Geo_ID. Unfortunately, it's also summing by A_Code.

A_Code_12, A_Code_65 and A_Code_98 should combine separately. Additionally, in the actual dataset, there are over 100 A_Codes.

Data

table_D

+--------+--------+--------+
| Geo_ID | A_Code | A_Cost |
+--------+--------+--------+
|      1 |     12 |      2 |
|      1 |     12 |      9 |
|      1 |     12 |      1 |
|      1 |     65 |     10 |
|      2 |     65 |      6 |
|      3 |     65 |      7 |
|      4 |     65 |      7 |
|      4 |     98 |      6 |
|      5 |     98 |      2 |
+--------+--------+--------+

table_D_dummies

+---+--------+--------+-----------+-----------+-----------+
|   | Geo_ID | A_Cost | A_Code_12 | A_Code_65 | A_Code_98 |
+---+--------+--------+-----------+-----------+-----------+
| 0 |      1 |      2 |         1 |         0 |         0 |
| 1 |      1 |      9 |         1 |         0 |         0 |
| 2 |      1 |      1 |         1 |         0 |         0 |
| 3 |      1 |     10 |         0 |         1 |         0 |
| 4 |      2 |      6 |         0 |         1 |         0 |
| 5 |      3 |      7 |         0 |         1 |         0 |
| 6 |      4 |      7 |         0 |         1 |         0 |
| 7 |      4 |      6 |         0 |         0 |         1 |
| 8 |      5 |      2 |         0 |         0 |         1 |
+---+--------+--------+-----------+-----------+-----------+

table_D_dummies_grouped

+--------+--------+-----------+-----------+-----------+
| Geo_ID | A_Cost | A_Code_12 | A_Code_65 | A_Code_98 |
+--------+--------+-----------+-----------+-----------+
|      1 |     22 |         3 |         1 |         0 |
|      2 |      6 |         0 |         1 |         0 |
|      3 |      7 |         0 |         1 |         0 |
|      4 |     13 |         0 |         1 |         1 |
|      5 |      2 |         0 |         0 |         1 |
+--------+--------+-----------+-----------+-----------+
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
PizzaAndCode
  • 340
  • 1
  • 3
  • 12
  • 2
    please share all data as text, no images please. Images cannot be copied and thus the issue cannot be replicated – anky Mar 17 '19 at 14:49
  • Good advice. [This](https://ozh.github.io/ascii-tables/) website generates ASCII tables for displaying tables in SO. – PizzaAndCode Mar 17 '19 at 15:10
  • 1
    generally speaking, people are more clear when you post a df as text so they can copy and an expected output df which abides your explaination(logic) check [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples), might be helpful in future. Cheers..! :) – anky Mar 17 '19 at 15:13
  • 1
    You beat me to it @anky_91. Always love a good formatting tip. – PizzaAndCode Mar 17 '19 at 15:27
  • No problem. I am glad you acknowledged. :) – anky Mar 17 '19 at 15:32

1 Answers1

3

You are not using your dummies table, you grouped the original dataframe:

table_D_dummies = pd.get_dummies(data = table_D, columns = ["A_Code"])
table_D_dummies_grouped = table_D.groupby(by = ["Geo_ID"]).sum()

You want to group table_D_dummies here:

>>> table_D_dummies
   Geo_ID  A_Cost  A_Code_12  A_Code_65  A_Code_98
0       1       2          1          0          0
1       1       9          1          0          0
2       1       1          1          0          0
3       1      10          0          1          0
4       2       6          0          1          0
5       3       7          0          1          0
6       4       7          0          1          0
7       4       6          0          0          1
8       5       2          0          0          1
>>> table_D_dummies.groupby(by = ["Geo_ID"]).sum()
        A_Cost  A_Code_12  A_Code_65  A_Code_98
Geo_ID
1           22          3          1          0
2            6          0          1          0
3            7          0          1          0
4           13          0          1          1
5            2          0          0          1

If you need to sum cost per dummy, add those to the grouping columns:

>>> table_D_dummies.groupby(by = [
...     "Geo_ID",
...     *(c for c in table_D_dummies.columns if c.startswith('A_Code_'))
... ]).sum()
                                      A_Cost
Geo_ID A_Code_12 A_Code_65 A_Code_98
1      0         1         0              10
       1         0         0              12
2      0         1         0               6
3      0         1         0               7
4      0         0         1               6
                 1         0               7
5      0         0         1               2
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Rats, I introduced that mistake while simplifying the code for sharing. This correction still doesn't result in the solution. Thank you! – PizzaAndCode Mar 17 '19 at 15:09
  • @PizzaAndCode: you haven't really provided an expected output, so we are guessing here what is expected. I added automated per-dummy groupings. – Martijn Pieters Mar 17 '19 at 15:18
  • You guessed right, that's the expected. Let me clean up the question for future readers' clarity. – PizzaAndCode Mar 17 '19 at 15:24