How do I index by a column and aggregate with another in Pandas?

How do I index by a column and aggregate with another in Pandas?

2 Answers2

Question

I have the following info in a dataframe in python:

From that dataframe, I require another dataframe that summarizes the information of all the students per state per age.

In other words, instead of having one entry per student (with their age and state) I'd like to have a data frame with the rows as the states and the columns as the ages.

Initial Data Frame

Student	State	Age
1	California	13
2	California	14
3	Colorado	12
4	Colorado	11
5	Colorado	12
6	Colorado	10
7	Colorado	13
8	Colorado	12
9	Colorado	13
10	Colorado	11
11	Florida	10
12	Florida	11
13	Texas	11
14	Texas	9
15	Texas	12
16	Texas	10

This is what I am expecting

Expected Result

And I need it through iterations because In reality I have thousands of rows and dozens of different variables

Do you know what I can do to achieve this?

sounds like a job for groupby and count https://stackoverflow.com/questions/19384532/get-statistics-for-each-group-such-as-count-mean-etc-using-pandas-groupby — Michael, Jan 30 '23 at 00:51
`df.pivot_table(index='State', columns = 'Age', values = 'Age', aggfunc = len)` will get you close to what you want. — njp, Jan 30 '23 at 01:08
What you're looking for is a crosstab operation. Try `df.groupby(['State', 'Age']).size().unstack(fill_value=0).reset_index()`. See [this answer](https://stackoverflow.com/a/73060100/19123103) for explanation and [this answer](https://stackoverflow.com/a/72933069/19123103) on why `groupby` works. — cottontail, Jan 30 '23 at 01:24

Luciano · Answer 1 · 2023-01-30T01:23:53.840

2

You want to group the students per State, showing the number of students of a specific age. This seems like an ideal use for a pivot table.

Here's the pandas implementation (Note: I replace all NaN values with 0s).

df.pivot_table(index='State', columns = 'Age', values = 'Age', aggfunc='count').fillna(0)

edited Jan 30 '23 at 01:23

answered Jan 30 '23 at 01:15

Luciano

354
1
7

score 1 · Answer 2 · answered Jan 30 '23 at 12:36

The crosstab function from pandas automatically aggregates the data and produces the expected result too.

pd.crosstab(index=df['State'], columns=df['Age'], dropna=True)

Age         9   10  11  12  13  14
State
----------------------------------                             
California   0   0   0   0   1   1
Colorado     0   1   2   3   2   0
Florida      0   1   1   0   0   0
Texas        1   1   1   1   0   0

California