While reading the answers to a related question in StackOverflow, I saw the code used in matplotlib to compute the wiskers positions and detect outliers:
# get high extreme
iq = q3 - q1
hi_val = q3 + whis * iq
wisk_hi = np.compress(d <= hi_val, d)
if len(wisk_hi) == 0 or np.max(wisk_hi) < q3:
wisk_hi = q3
else:
wisk_hi = max(wisk_hi)
Now, the else
part makes perfect sense - as per the specification of Tukey boxplots, we find the highest datum within 1.5 IQR of the upper quartile. Indeed, that is max(wish_hi)
- the largest data entry that is below Q3+1.5*IQR
.
The or
part however... that I don't understand. The if len(wisk_hi) == 0
translates to...
if we find no elements below the `hi_val` ...
How can this condition apply? Q3 is found by splitting the data on the median, then taking the median of the upper half, and then adding 1.5*IQR on top of that - how can there NOT be data lower than this value?
If this is about an empty dataset, then the second part of the or
doesn't make sense either (since Q3 or IQR don't make sense without data).
Probably missing something obvious - help?