1

While reading the answers to a related question in StackOverflow, I saw the code used in matplotlib to compute the wiskers positions and detect outliers:

    # get high extreme
    iq = q3 - q1
    hi_val = q3 + whis * iq
    wisk_hi = np.compress(d <= hi_val, d)
    if len(wisk_hi) == 0 or np.max(wisk_hi) < q3:
        wisk_hi = q3
    else:
        wisk_hi = max(wisk_hi)

Now, the else part makes perfect sense - as per the specification of Tukey boxplots, we find the highest datum within 1.5 IQR of the upper quartile. Indeed, that is max(wish_hi) - the largest data entry that is below Q3+1.5*IQR.

The or part however... that I don't understand. The if len(wisk_hi) == 0 translates to...

if we find no elements below the `hi_val` ...

How can this condition apply? Q3 is found by splitting the data on the median, then taking the median of the upper half, and then adding 1.5*IQR on top of that - how can there NOT be data lower than this value?

If this is about an empty dataset, then the second part of the or doesn't make sense either (since Q3 or IQR don't make sense without data).

Probably missing something obvious - help?

Community
  • 1
  • 1
ttsiodras
  • 10,602
  • 6
  • 55
  • 71

2 Answers2

1

The interquartile range can be biased. "The upper adjacent value can be less than Q3, which forces the whisker to be drawn from Q3 into the box. The lower adjacent value can also be greater than Q1, which forces the whisker to be drawn from Q1 into the box." (source)

IQR = Q3 - Q1

Lower limit: Q1 - 1.5 (Q3 - Q1)

Upper limit: Q3 + 1.5 (Q3 - Q1)

Check out the data in the link.

noumenal
  • 1,077
  • 2
  • 16
  • 36
  • 1
    Thanks! Now that I see the data, I understand. – ttsiodras May 30 '16 at 11:32
  • 1
    To avoid external dependencies for future readers of this answer: see what happens at the lower wisker with this dataset: [1200, 1443, 1490, 1528, 1563, 2479] – ttsiodras May 30 '16 at 11:44
0

The example output below (driven from matplotlib test data, in fact) shows the problem:

$ ipython2
Python 2.7.11 (default, Mar 31 2016, 06:18:34) 
IPython 4.2.0 -- An enhanced Interactive Python.

In [1]: import numpy as np

In [2]: import matplotlib

In [3]: a=[3, 9000, 150, 88, 350, 200000, 1400, 960]

In [4]: sa=list(sorted(a))

In [5]: sa
Out[5]: [3, 88, 150, 350, 960, 1400, 9000, 200000]

In [6]: globals().update(matplotlib.cbook.boxplot_stats(a)[0])

In [7]: q3
Out[7]: 3300.0

In [8]: iqr
Out[8]: 3165.5

In [9]: q3+1.5*iqr
Out[9]: 8048.25

...so the largest element smaller than q3+1.5*iqr is... 1400!

The upper whisker would have to go DOWN from q3 (3300) to 1400 if the code didn't include that test.

ttsiodras
  • 10,602
  • 6
  • 55
  • 71