2

Using boxplot from matplotlib.pyplot the quartile values are calculated by including the median. Can this be changed to NOT include the median?

For example, consider the ordered data set

2, 3, 4, 5, 6, 7, 8

If the median is NOT included, then Q1=3 and Q3=7. However, boxplot includes the median value, i.e. 5, and generates the figure below

Boxplot

Is it possible to change this behavior, and NOT include the median in the calculation of the quartiles? This should correspond to Method 1 as described on on the Wikipedia page Quartile. The code to generate the figure is listed below

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator

data = [2, 3, 4,  5,    6, 7, 8]

fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])
bp = ax.boxplot(data, '', 
                vert=False,
                positions=[0.4],
                widths=[0.3])

ax.set_xlim([0,9])
ax.set_ylim([0,1])

ax.xaxis.set_major_locator(MultipleLocator(1))

ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)

ax.yaxis.set_ticks([])

ax.grid(which='major',axis='x',lw=0.1)

plt.show()
John
  • 1,645
  • 2
  • 17
  • 29
  • Your way to calculate the quartiles isn't very standard. Which elements would you choose as quartiles for arrays of 8, 9 or 10 elements long? With an even size, do you want to average of the two center elements as median? Numpy's quantile function supports 5 types of calculations, but none would give your q1 and q3. "nearest" gives 4 and 6. "linear" is the only method supported by `ax.boxplot`. You could write your own function and provide these values to `ax.bxp`. – JohanC Mar 20 '21 at 23:14

1 Answers1

2

The question is motivated by the fact that several educational resources around the internet do not calculate the quartiles as the default settings used by matplotlib's boxplot. For example, in the online course, "Statistics and probability" from Khan Academy, the quartiles are calculated as described in Method 1 on the Wikipedia page Quartiles, while boxplot employs Method 2.

Consider an example from Khan Academy's course "Statistics and probability" section "Comparing range and interquartile range (IQR)" . The daily high temperatures are recorded in Paradise, MI. for 7 days and found to be 16, 24, 26, 26,26, 27, and 28 degree Celsius. Describe the data with a boxplot and calculate IQR.

The result of using the default settings in boxplot and that presented by Prof. Khan are very different, see figure below.

Boxplots with Quartiles calculated according to Method 1 and 2

The IQR found by matplotlib is 1.5, and that calculated by Prof. Khan is 3. As pointed out in the comments by @JohanC, boxplot can not directly be configured to follow Method 1, but requires a customized function. Therefore, neglecting the calculation of outliers, I updated the code to calculate the quartiles according to Method 1, and thus be comparable with the Khan Academy course. The code is listed below, not very pythonic, suggestions are welcome.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook
from matplotlib.ticker import MultipleLocator


def median(x):
    """
    x - input a list of numbers
    Returns the midpoint number, for example
    in a list with oddnumbers 
    [1,2, 3, 4,5] returns 3
    for a list with even numbers the algebraic mean is returned, e.g
    [1,2,3,4]   returns 2.5
    """
    if len(x)&1:
        # Odd number of elements in list, e.g. x = [1,2,3] returns 2
        index_middle = int((len(x)-1)/2)
        median = x[index_middle]
    else:
        # Even number of elements in list, e.g. x = [-1,2] returns 0.5
        index_lower = int(len(x)/2-1)
        index_upper = int(len(x)/2)
        median = (x[index_lower]+x[index_upper])/2

    return median


def method_1_quartiles(x):
    """
    x - list of numbers
    """
    x.sort()
    N = len(x)
    if N&1:
        # Odd number of elements
        index_middle = int((N-1)/2)
        lower = x[0:index_middle] # Up to but not including
        upper = x[index_middle+1:N+1]
        Q1= median(lower)
        Q2 = x[index_middle]
        Q3 = median(upper)
    else:
        # Even number of elements
        index_lower = int(N/2)
        lower = x[0:index_lower]
        upper = x[index_lower:N]

        Q1= median(lower)
        Q2 = (x[index_lower-1]+x[index_lower])/2
        Q3 = median(upper)
    
    return Q1,Q2,Q3


data = [16,24,26,   26,   26,27,28]

fig = plt.figure(figsize=(6,1))
ax = fig.add_axes([0.1,0.25,0.8,0.8])



stats = cbook.boxplot_stats(data,)[0]

Q1_default = stats['q1']
Q3_default = stats['q3']

stats['whislo']=min(data)
stats['whishi']=max(data)

IQR_default = Q3_default - Q1_default  

Q1, Q2, Q3 = method_1_quartiles(data)
IQR = Q3-Q1
stats['q1'] = Q1
stats['q3'] = Q3
print(f"IQR: {IQR}")



ax.bxp([stats],vert=False,manage_ticks=False,widths=[0.3],positions=[0.4],showfliers=False)

ax.set_xlim([15,30])
ax.set_ylim([0,1])

ax.xaxis.set_major_locator(MultipleLocator(1))

ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.spines["top"].set_visible(False)

ax.yaxis.set_ticks([])

ax.grid(which='major',axis='x',lw=0.1)


plt.show()

The graph generated is

Boxplot with quartiles according to Method 1

John
  • 1,645
  • 2
  • 17
  • 29