0

I tried to understand how matplotlib draws a graph. This is the code that I write.

import matplotlib.pyplot as plt
import pandas as pd

age=[20,22,22,23,23,23,23,24,24,24,24,26,26,30]
df=pd.DataFrame(age, columns=['age']) 
df['age'].describe()

This is the data printed out

count    14.000000
mean     23.857143
std       2.348720
min      20.000000
25%      23.000000
50%      23.500000
75%      24.000000
max      30.000000
Name: age, dtype: float64

I calculated the value IQR, L, U

IQR = Q3 - Q1= 24 – 23 = 1
L = Q1 – 1.5 * IQR = 23 – 1.5 * 1 = 21.5
U = Q3 + 1.5 * IQR = 24 + 1.5 * 1 = 25.5

However, the graph generated by matplotlib is different from what I calculate:

df.boxplot(column = ['age']) 

enter image description here

The value of L and U extreme is 22 and 24 (not 21.5 and 25.5)

What is the formula for L and U (lower and upper extreme) that matplotlib uses?

Thanks a lot for pointing out my mistakes?

Ch3steR
  • 20,090
  • 4
  • 28
  • 58
Arevik
  • 21
  • 8
  • 1
    The following duplicate specifically answers this question [What exactly do the whiskers in pandas' boxplots specify?](https://stackoverflow.com/questions/12082568/what-exactly-do-the-whiskers-in-pandas-boxplots-specify). From that answer, **the upper whisker will extend to last datum less than Q3 + whis*IQR** – Trenton McKinney May 31 '20 at 16:45

1 Answers1

1

The values you calculated are the bounds, and not L and U.

L and U are the mimimum/maximum points from the data that are within 1.5 IQR from Q1/Q3.

You calculate 21.5 as the lower bound, but the smallest point in your data that is greater than or equal that bound is 22.

Similarly, the upper bound you calculated is 25.5, but the largest point in your data that is less than or equal that bound is 24

Aziz
  • 20,065
  • 8
  • 63
  • 69