This question is similar to this one, but with a crucial difference - the solution to the linked question does not solve the issue when the dataframe is grouped into bins.
The following code to boxplot the relative distribution of the bins of the 2 variables produces an error:
import pandas as pd
import seaborn as sns
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
sns.boxplot(x='regiment', y='preTestScore', data=df1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-241-fc8036eb7d0b> in <module>()
----> 1 sns.boxplot(x='regiment', y='preTestScore', data=df1)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
2209 plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
2210 orient, color, palette, saturation,
-> 2211 width, dodge, fliersize, linewidth)
2212
2213 if ax is None:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
439 width, dodge, fliersize, linewidth):
440
--> 441 self.establish_variables(x, y, hue, data, orient, order, hue_order)
442 self.establish_colors(color, palette, saturation)
443
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
149 if isinstance(input, string_types):
150 err = "Could not interpret input '{}'".format(input)
--> 151 raise ValueError(err)
152
153 # Figure out the plotting orientation
ValueError: Could not interpret input 'regiment'
If I remove the x
and y
parameters, it produces a boxplot, but its the not the one I want:
How do I fix this? I tried the following:
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
df1 = df1.reset_index()
df1
It now looks like a dataframe, so I thought of extracting the column names of this dataframe and plotting for each one sequentially:
cols = df1.columns[1:len(df1.columns)]
for i in range(len(cols)):
sns.boxplot(x='regiment', y=cols[i], data=df1)
This doesn't look right. In fact, this is not a normal dataframe; if we print out its columns, it does not show regiment
as a column, which is why boxplot gives the error ValueError: Could not interpret input 'regiment'
:
df1.columns
>>> Index(['regiment', 2, 3, 4, 24, 31], dtype='object', name='preTestScore')
So, if I could just somehow make regiment
a column of the dataframe, I think I should be able to plot the boxplot of preTestScore
vs regiment
. Am I wrong?
EDIT: What I want is something like this:
df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
# This df2 dataframe is the one I'm trying to construct using groupby
data = {'regiment':['Dragoons', 'Nighthawks', 'Scouts'], 'preTestScore 2':[0.0, 1.0, 2.0], 'preTestScore 3':[1.0, 0.0, 2.0],
'preTestScore 4':[1.0, 1.0, 0.0], 'preTestScore 24':[1.0, 1.0, 0.0], 'preTestScore 31':[1.0, 1.0, 0.0]}
cols = ['regiment', 'preTestScore 2', 'preTestScore 3', 'preTestScore 4', 'preTestScore 24', 'preTestScore 31']
df2 = pd.DataFrame(data, columns=cols)
df2
fig = plt.figure(figsize=(20,3))
count = 1
for col in cols[1:]:
plt.subplot(1, len(cols)-1, count)
sns.boxplot(x='regiment', y=col, data=df2)
count+=1