1

I am grouping by two columns in a Pandas DataFrame, after which I count the size of each group. This grouped DataFrame will then be filtered and the data plotted in a bar chart.

The issue i am having is that if a group has a zero count, it is not shown in the DataFrame and therefore does not appear on the plot. So the plot has missing categories on the x-axis when i would rather them include a category even if there is no bar to display (i.e represent the category as zero, thereby imo making the plot more representative of the data as a whole).

# Import the required packages.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Set the appearance of plots.
plt.style.use('ggplot')

# Create sample DataFrame.
data = {'ID':[1, 2, 3, 4, 5, 6, 7], 'Name':['Tom', 'Jack', 'Anne', 'Steve', 'Ricky', 'Jane', 'Beth'], 'Age':[28,34,29,42,15,10,26], 'Voted':[0, 1, 0, 1, 1, 0, 0]}
df = pd.DataFrame(data)

# Bin into age groups and create an Age Group column in the DataFrame.
bins = list(range(0, 60, 10))
df['Age Group'] = pd.cut(df['Age'], bins, right=False)

# Group data by Age Group and Voted columns. Then perform count using the ID column. Make Age Group the new index.
groups = df.groupby(['Age Group', 'Voted'])
new_df = groups.agg({'ID': 'count'}).rename(columns={'ID':'Count'})    
new_df.reset_index(inplace=True)
new_df.set_index('Age Group', inplace=True)
new_df

The code above will output this:

         Voted  ID
Age Group       
[10, 20)    0   1
[10, 20)    1   1
[20, 30)    0   3
[30, 40)    1   1
[40, 50)    1   1

What I'd like is for something like the result below, from which i can filter out just the voted=1 age groups and plot in a chart:

         Voted  ID
Age Group
[0, 10)     0   0
[0, 10)     1   0       
[10, 20)    0   1
[10, 20)    1   1
[20, 30)    0   3
[20, 30)    1   0
[30, 40)    0   0
[30, 40)    1   1
[40, 50)    0   0
[40, 50)    1   1  

I have searched the similar questions/results (most relative below), but i can't seem to get either to work.

[Pandas groupby for zero values [Pandas Groupby How to Show Zero Counts in DataFrame

I've also noticed that if i perform the count on just a single column, that zero groups do show up in the DataFrame. Why is this? eg:

# Group data by just Age Group column. Then perform count using the ID column.
groups = df.groupby(['Age Group'])
new_df = groups.agg({'ID': 'count'}).rename(columns={'ID':'Count'})
new_df # count displays the zero here for the 0-10 age group.

Any help in explaining what is going on here would be appreciated.

MatthewCarterIO
  • 195
  • 1
  • 1
  • 9

1 Answers1

0

Cause pd.cut will return categorical data. That is why you see the different between groupby two column with only categorical columns

Here is one way to fix the out put using reindex

new_df.reindex(pd.MultiIndex.from_product([np.unique(pd.cut(np.arange(50), bins, right=False)).tolist(),[0,1]]),fill_value=0)
Out[277]: 
            Count
[0, 10)  0      0
         1      0
[10, 20) 0      1
         1      1
[20, 30) 0      3
         1      0
[30, 40) 0      0
         1      1
[40, 50) 0      0
         1      1
BENY
  • 317,841
  • 20
  • 164
  • 234