1

I would like to plot boxplots for several datasets based on a criterion. Imagine a dataframe similar to the example below:

df = pd.DataFrame({'Group':[1,1,1,2,3,2,2,3,1,3],'M':np.random.rand(10),'F':np.random.rand(10)})
df = df[['Group','M','F']]

   Group         M         F
0      1  0.465636  0.537723
1      1  0.560537  0.727238
2      1  0.268154  0.648927
3      2  0.722644  0.115550
4      3  0.586346  0.042896
5      2  0.562881  0.369686
6      2  0.395236  0.672477
7      3  0.577949  0.358801
8      1  0.764069  0.642724
9      3  0.731076  0.302369

In this case, I have three groups, so I would like to make a boxplot for each group and for M and F separately having the groups on Y axis and the columns of M and F colour-coded. This answer is very close to what I want to achieve, but I would prefer something more robust, applicable for larger dataframes with greater number of groups. I feel that groupby is the way to go, but I am not familiar with groupby objects and I am failing to even slice them. . The desirable output would look something like this:enter image description here

Looks like years ago, someone had the same problem, but got no answers :( Having a boxplot as a graphical representation of the describe function of groupby

My questions are:

  1. How to implement groupby to feed the desired data into the boxplot
  2. What is the correct syntax for the box plot if I want to control what is displayed and not just use default settings (which I don't even know what they are, I am finding the documentation rather vague. To be specific,can I have the box covering the mean +/- standard deviation, and keep the vertical line at median value?)
durbachit
  • 4,626
  • 10
  • 36
  • 49
  • Did you try some code? what kind of problems / errors did you get? – Ruthger Righart May 24 '17 at 07:57
  • import matplotlib.pyplot as plt and then df.boxplot(['M','F'],'Group') – jarry jafery May 24 '17 at 08:18
  • this will generate 2 separate plots for male and female and on the basis of groups. – jarry jafery May 24 '17 at 08:19
  • As you said, this generates separate subplots, it does not plot them together. Plus it does not address the point number 2. But thanks, for a simpler case it is good to know how easily it can be done. – durbachit May 24 '17 at 08:29
  • please try this it will give you the 4 quartiles on your x axis df.boxplot(by='Group',vert=False) it would be difficult to get all the variables in a single plot as we are also applying groupby operation at the same time but we can get the multiple plots depends on the basis of variables grouped by grouping variable. – jarry jafery May 24 '17 at 09:30
  • you can read the documentation https://matplotlib.org/examples/pylab_examples/boxplot_demo.html and https://matplotlib.org/examples/pylab_examples/boxplot_demo2.html . – jarry jafery May 24 '17 at 09:32
  • @jarryjafery There are two problems here: 1) you are mixing the matplotlib boxplot with pandas boxplot (the documentation you show is that for matplotlib boxplot, which I am more familiar with than the one for pandas which is in your example). 2) I can't figure out how to get the required statistical settings for the boxplot, as stated in the question: "the box covering the mean +/- standard deviation, and keep the vertical line at median value". And whiskers covering the full range of the interval. (in matplotlib, it can be achieved by `whisk='range'`) – durbachit May 24 '17 at 23:26
  • Oh I see. I apologise. The reason why it doesn't work for my real data is that it contains NaN, not that the keywords from matplotlib don't work. – durbachit May 25 '17 at 02:29

2 Answers2

3

I think you should use Seaborn library that offers to create these type of customize plots.In your case i had first melted your dataframe to convert it into proper format and then created the boxplot of your choice.

import pandas as pd
import matplotlib.pyplot as plt
Import seaborn as sns
dd=pd.melt(df,id_vars=['Group'],value_vars=['M','F'],var_name='sex')
sns.boxplot(y='Group',x='value',data=dd,orient="h",hue='sex')

The plot looks similar to your required plot. enter image description here

jarry jafery
  • 1,018
  • 1
  • 14
  • 25
1

Finally, I found a solution by slightly modifying this answer. It does not use groupby object, so it is more tedious to prepare the data, but so far it looks like the best solution to me. Here it is:

# here I prepare the data (group them manually and then store in lists)

Groups=[1,2,3]
Columns=df.columns.tolist()[1:]
print Columns
Mgroups=[]
Fgroups=[]

for g in Groups:
    dfgc = df[df['Group']==g]
    m=dfgc['M'].dropna()
    f=dfgc['F'].dropna()
    Mgroups.append(m.tolist())
    Fgroups.append(f.tolist())

fig=plt.figure()
ax = plt.axes()
def setBoxColors(bp,cl):
    plt.setp(bp['boxes'], color=cl, linewidth=2.)
    plt.setp(bp['whiskers'], color=cl, linewidth=2.5)
    plt.setp(bp['caps'], color=cl,linewidth=2)
    plt.setp(bp['medians'], color=cl, linewidth=3.5)

bpl = plt.boxplot(Mgroups, positions=np.array(xrange(len(Mgroups)))*3.0-0.4,vert=False,whis='range', sym='', widths=0.6)
bpr = plt.boxplot(Fgroups, positions=np.array(xrange(len(Fgroups)))*3.0+0.4,vert=False,whis='range', sym='', widths=0.6)
setBoxColors(bpr, '#D7191C') # colors are from http://colorbrewer2.org/
setBoxColors(bpl, '#2C7BB6')

# draw temporary red and blue lines and use them to create a legend
plt.plot([], c='#D7191C', label='F')
plt.plot([], c='#2C7BB6', label='M')
plt.legend()

plt.yticks(xrange(0, len(Groups) * 3, 3), Groups)
plt.ylim(-3, len(Groups)*3)
#plt.xlim(0, 8)
plt.show()

Resulting plot.

The result looks mostly like what I wanted (as far as I have been able to find, the box always ranges from first to third quartile, so it is not possible to set it to +/- standard deviation). So I am a bit disappointed there is no one-line solution, but I am glad it is possible. However, for hundreds of groups this would not be good enough...

durbachit
  • 4,626
  • 10
  • 36
  • 49