1

I have a dataframe with a variable of interest (categorical, here Yes, No, etc.) and a grouping variable (see below):

import pandas as pd
import numpy as np
df = pd.DataFrame({ 
    'ID': range(100),
    'group': np.random.choice(['A', 'B', 'C'], 100),
    'Response':np.random.choice(['Yes','No','Other', np.nan], 100)})

From this, I would like to retrieve and plot the accumulated data per group in a bar plot.

In detail: for group A, the percentage of Yes, No, etc., the same for group by. The command df['Response'].groupby(df['group']).value_counts() already gives me this output:

group  Response
A      Other       14
       No           8
       Yes          8
       nan          8
B      Other       11
       nan         11
       No           5
       Yes          4
C      No           9
       Yes          9
       nan          7
       Other        6
Name: Response, dtype: int64

This is what I want but I can't find a way to plot it appropriately (in matplotlib or seaborn) and am unsure if this is an issue of data transformation or visualization.

This question is asking about something similar but I can't get it to work with unstack:

df = df['group'].unstack(0, fill_value = 0)

gives

AttributeError: 'RangeIndex' object has no attribute 'remove_unused_levels'

and

df = df['group'].unstack(0, fill_value = 0)
df.index.name = None
df.columns.name = None
df.plot.bar(stacked=True)

only plots the ID (ungrouped).

rpanai
  • 12,515
  • 2
  • 42
  • 64
Ivo
  • 3,890
  • 5
  • 22
  • 53

2 Answers2

3

You should do unstack without select the columns , the groupby output is Series, and notice you are using Series groupby not pd.DataFrame.groupby

df['Response'].groupby(df['group']).value_counts().unstack(fill_value=0)

type(df['Response'].groupby(df['group']).value_counts())
Out[207]: pandas.core.series.Series
BENY
  • 317,841
  • 20
  • 164
  • 234
  • excellent, thank you. Using `df.plot.bar(stacked=True)`, I get the desired plot. How do I scale from absolute numbers to percentages in this plot (the groups are of different sizes)? – Ivo Mar 02 '19 at 15:29
  • @Ivo yw :-) happy coding – BENY Mar 02 '19 at 15:35
  • @Ivo pass normalized in value_counts as True – BENY Mar 02 '19 at 17:43
1

I think better is use grouping by column in string and get column after groupby for processing:

df1 = df.groupby('group')['Response'].value_counts().unstack(fill_value=0)

Alternative is use crosstab:

df1 = pd.crosstab(df['group'], df['Response'])

For remove index and columns names is possible use your solution or DataFrame.rename_axis:

#pandas 0.24+
df1.rename_axis(index=None, columns=None).plot.bar(stacked=True)

#pandas bellow
#df1.rename_axis(None).rename_axis(None, axis=1).plot.bar(stacked=True)

Your solution is mainly syntactic sugar for the alternative and much more verbose, obviously is using if need process column before, e.g need lowercase of column Response:

df1 = df['Response'].str.lower().groupby(df['group']).value_counts().unstack(fill_value=0)

First solution is possible, but need 2 rows:

df['Response'] = df['Response'].str.lower()
df1 = df.groupby('group')['Response'].value_counts().unstack(fill_value=0)

EDIT:

For scaling use normalize=True in Series.value_counts:

.value_counts(normalize=True)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • excellent, thank you. Using `df1.plot.bar(stacked=True)`, I get the desired plot. How do I scale from absolute numbers to percentages in this plot (the groups are of different sizes)? – Ivo Mar 02 '19 at 15:29
  • @Ivo - For scaling change `value_counts()` to `value_counts(normalize=True)` – jezrael Mar 02 '19 at 15:31