0

I have the following info in a dataframe in python:

From that dataframe, I require another dataframe that summarizes the information of all the students per state per age.

In other words, instead of having one entry per student (with their age and state) I'd like to have a data frame with the rows as the states and the columns as the ages.

Initial Data Frame

Student State Age
1 California 13
2 California 14
3 Colorado 12
4 Colorado 11
5 Colorado 12
6 Colorado 10
7 Colorado 13
8 Colorado 12
9 Colorado 13
10 Colorado 11
11 Florida 10
12 Florida 11
13 Texas 11
14 Texas 9
15 Texas 12
16 Texas 10

This is what I am expecting

Expected Result

State 9 10 11 12 13 14
California 0 0 0 0 1 1
Colorado 0 1 2 3 2 0
Florida 0 1 1 0 0 0
Texas 1 1 1 1 0 0

And I need it through iterations because In reality I have thousands of rows and dozens of different variables

Do you know what I can do to achieve this?

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • sounds like a job for groupby and count https://stackoverflow.com/questions/19384532/get-statistics-for-each-group-such-as-count-mean-etc-using-pandas-groupby – Michael Jan 30 '23 at 00:51
  • 1
    `df.pivot_table(index='State', columns = 'Age', values = 'Age', aggfunc = len)` will get you close to what you want. – njp Jan 30 '23 at 01:08
  • 1
    What you're looking for is a crosstab operation. Try `df.groupby(['State', 'Age']).size().unstack(fill_value=0).reset_index()`. See [this answer](https://stackoverflow.com/a/73060100/19123103) for explanation and [this answer](https://stackoverflow.com/a/72933069/19123103) on why `groupby` works. – cottontail Jan 30 '23 at 01:24

2 Answers2

2

You want to group the students per State, showing the number of students of a specific age. This seems like an ideal use for a pivot table.

Here's the pandas implementation (Note: I replace all NaN values with 0s).

df.pivot_table(index='State', columns = 'Age', values = 'Age', aggfunc='count').fillna(0)
Luciano
  • 354
  • 1
  • 7
1

The crosstab function from pandas automatically aggregates the data and produces the expected result too.

pd.crosstab(index=df['State'], columns=df['Age'], dropna=True)
Age         9   10  11  12  13  14
State
----------------------------------                             
California   0   0   0   0   1   1
Colorado     0   1   2   3   2   0
Florida      0   1   1   0   0   0
Texas        1   1   1   1   0   0
AboAmmar
  • 5,439
  • 2
  • 13
  • 24