1

This question is similar to this one, but with a crucial difference - the solution to the linked question does not solve the issue when the dataframe is grouped into bins.

The following code to boxplot the relative distribution of the bins of the 2 variables produces an error:

import pandas as pd
import seaborn as sns

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])


df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)


sns.boxplot(x='regiment', y='preTestScore', data=df1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-241-fc8036eb7d0b> in <module>()
----> 1 sns.boxplot(x='regiment', y='preTestScore', data=df1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in boxplot(x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth, whis, notch, ax, **kwargs)
   2209     plotter = _BoxPlotter(x, y, hue, data, order, hue_order,
   2210                           orient, color, palette, saturation,
-> 2211                           width, dodge, fliersize, linewidth)
   2212 
   2213     if ax is None:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in __init__(self, x, y, hue, data, order, hue_order, orient, color, palette, saturation, width, dodge, fliersize, linewidth)
    439                  width, dodge, fliersize, linewidth):
    440 
--> 441         self.establish_variables(x, y, hue, data, orient, order, hue_order)
    442         self.establish_colors(color, palette, saturation)
    443 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\categorical.py in establish_variables(self, x, y, hue, data, orient, order, hue_order, units)
    149                 if isinstance(input, string_types):
    150                     err = "Could not interpret input '{}'".format(input)
--> 151                     raise ValueError(err)
    152 
    153             # Figure out the plotting orientation

ValueError: Could not interpret input 'regiment'

If I remove the x and y parameters, it produces a boxplot, but its the not the one I want:

enter image description here

How do I fix this? I tried the following:

df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)
df1 = df1.reset_index()
df1

enter image description here

It now looks like a dataframe, so I thought of extracting the column names of this dataframe and plotting for each one sequentially:

cols = df1.columns[1:len(df1.columns)]
for i in range(len(cols)):
    sns.boxplot(x='regiment', y=cols[i], data=df1)

enter image description here

This doesn't look right. In fact, this is not a normal dataframe; if we print out its columns, it does not show regiment as a column, which is why boxplot gives the error ValueError: Could not interpret input 'regiment':

df1.columns
>>> Index(['regiment', 2, 3, 4, 24, 31], dtype='object', name='preTestScore')

So, if I could just somehow make regiment a column of the dataframe, I think I should be able to plot the boxplot of preTestScore vs regiment. Am I wrong?


EDIT: What I want is something like this:

df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)

# This df2 dataframe is the one I'm trying to construct using groupby
data = {'regiment':['Dragoons', 'Nighthawks', 'Scouts'], 'preTestScore 2':[0.0, 1.0, 2.0], 'preTestScore 3':[1.0, 0.0, 2.0],
        'preTestScore 4':[1.0, 1.0, 0.0], 'preTestScore 24':[1.0, 1.0, 0.0], 'preTestScore 31':[1.0, 1.0, 0.0]}

cols = ['regiment', 'preTestScore 2', 'preTestScore 3', 'preTestScore 4', 'preTestScore 24', 'preTestScore 31']

df2 = pd.DataFrame(data, columns=cols)
df2

enter image description here

fig = plt.figure(figsize=(20,3))

count = 1
for col in cols[1:]:
    plt.subplot(1, len(cols)-1, count)
    sns.boxplot(x='regiment', y=col, data=df2)
    count+=1

enter image description here

Kristada673
  • 3,512
  • 6
  • 39
  • 93
  • `value_counts` will count the number of unique values (in this case number of unique preTestScores for each group). How do you want the boxplot to look like in the end? – Shaido Aug 15 '18 at 02:23
  • `sns.boxplot` groups data by itself, if you just do `sns.boxplot(x='regiment', y='preTestScore', data=df)`, will it be the desired result? – Teoretic Aug 15 '18 at 02:28
  • @Teoretic I've edited the question to show what I'm looking for. – Kristada673 Aug 15 '18 at 05:55

1 Answers1

2

If you do reset_index() on your dataframe df1, you should get the dataframe you want to have.

The problem was that you have one of your desired columns (regiment) as an index, so you needed to reset it and make it an another column.

Edit: added add_prefix for proper column names in the resulting dataframe

Sample code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])


df1 = df.groupby(['regiment'])['preTestScore'].value_counts().unstack()
df1.fillna(0, inplace=True)

df1 = df1.add_prefix('preTestScore ')  # <- add_prefix for proper column names

df2 = df1.reset_index()  # <- Here is reset_index()
cols = df2.columns

fig = plt.figure(figsize=(20,3))

count = 1
for col in cols[1:]:
    plt.subplot(1, len(cols)-1, count)
    sns.boxplot(x='regiment', y=col, data=df2)
    count+=1

Output:
enter image description here

Teoretic
  • 2,483
  • 1
  • 19
  • 28
  • I had done this `reset_index()`, I've mentioned it in the question. The code after that point hadn't worked. – Kristada673 Aug 15 '18 at 07:17
  • @Kristada673 if you try to copy-paste and execute my code, doesn't it produce the same output as in _What I want is something like this_ section? – Teoretic Aug 15 '18 at 07:28
  • @Kristada673 well, it doesn't set "preTestScore " suffixes in column names, do you want to add those suffixes? (like make column "2" named "preTestScore 2")? – Teoretic Aug 15 '18 at 07:31
  • Yes, it does. Yes, it would be nice to append the main column name to the y-values. – Kristada673 Aug 15 '18 at 07:39