I'm learning pandas
, with a strong SQL background, so I need to rethink many habits and frames of mind. While I think I understand the groupby()
method, I just can't figure out how to apply it over multiple columns.
Let's say we have this table in a database:
+----+--------------+-----------+--------------+-------+
| id | product_name | category | subcategory | price |
+----+--------------+-----------+--------------+-------+
| 1 | product1 | category1 | subcategory1 | 8.41 |
| 2 | product2 | category1 | subcategory1 | 62.74 |
| 3 | product3 | category1 | subcategory2 | 85.84 |
| 4 | product4 | category2 | subcategory2 | 32.71 |
| 5 | product5 | category2 | subcategory1 | 39.62 |
| 6 | product6 | category2 | subcategory1 | 37.43 |
| 7 | product7 | category3 | subcategory2 | 55.01 |
| 8 | product8 | category3 | subcategory1 | 26.91 |
| 9 | product9 | category3 | subcategory3 | 77.13 |
| 10 | product10 | category3 | subcategory3 | 40.79 |
+---+--------------+-----------+--------------+-------+
It's very easy to do an aggregate on multiple columns:
select category, subcategory, avg(price) as avg_price from my_table group by category, subcategory
which returns this:
+-----------+--------------+-----------+
| category | subcategory | avg_price |
+-----------+--------------+-----------+
| category1 | subcategory1 | 35.575 |
| category1 | subcategory2 | 85.84 |
| category2 | subcategory1 | 38.525 |
| category2 | subcategory2 | 32.71 |
| category3 | subcategory1 | 26.91 |
| category3 | subcategory2 | 55.01 |
| category3 | subcategory3 | 58.96 |
+-----------+--------------+-----------+
So, in my obviously incorrect understanding, this would have done the same in pandas:
df['price'].groupby(df[['category', 'subcategory']]).mean()
which returns ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
, while:
df['price'].groupby(df['category']).mean()
works as expected.
Could someone help me?