0

So I'm trying to do a very simple boxplot of one continuous variable against one discrete yes/no variable, and for reasons I totally don't understand, I can't get the range bars to display for the "NO" values.

Here's a simplified dataset ... save in your working directory as "femplot.csv"

SEQN,LBXVBZ,smoke
73614,0.206,YES
73616,0.017,NO
73739,0.017,NO
73751,0.135,YES
73763,0.237,YES
73766,0.017,NO
73805,0.19,YES
73848,0.017,NO
73914,0.198,YES
73924,0.017,NO
73938,0.161,YES
73975,0.167,YES
74006,0.031,YES
74007,0.017,NO
74008,0.017,NO
74022,0.147,YES
74046,0.017,NO
74054,0.017,NO
74091,0.156,YES
74101,0.179,YES
74141,0.106,NO
74150,0.115,YES
74154,0.017,NO
74160,0.017,NO
74173,0.035,NO
74180,0.017,NO
74195,0.017,NO
74211,0.017,NO
74221,0.078,YES

Now here's my code ... I'm trying this both using the R-native boxplot function and ggplot, with the same result:

library(ggplot2)

femplot <- read.csv("femplot.csv")

boxplot(LBXVBZ~smoke, data = femplot)

ggplot(data = femplot, aes(x=smoke, y=LBXVBZ))+
  stat_boxplot(geom="errorbar", width=0.5, coef = 10)+
  geom_boxplot()+
  stat_summary(fun = "mean", shape=23, color="red")+
  labs(x="Smoker", y="Benzene"),
       title = "Distribution of blood benzene levels among smokers/nonsmokers") 

Here's the output. Note that there's a box and range lines on the "YES" values, but none for the "NO". In point of fact the box is vanishingly small for "NO", so that's just fine, but I should still get rangelines, since there are dots to show a range. I haven't bothered to include the image for the standard boxplot output but it's equivalent. Sample output

DanM
  • 337
  • 3
  • 9
  • 1
    All of your NO values look like they are 0.017. If the first and third quartile of your data are the same, the "box" part of the box plot gets squished to a line. This is just what your data looks like. Perhaps you should consider some other visualization if that's not what you want. – MrFlick Oct 23 '20 at 21:52
  • Almost all, but not all. There's one value of 0.035, and an outlier of 0.106. The fact that the rest are the same is what depresses the average, and so you're right, the box is squashed to nothing. But the range bars should include those two outliers (obviously, there are more values in the live data). – DanM Oct 23 '20 at 21:56
  • 3
    The range bars only go out to the most extreme value not more than 1.5 times the interquartile range. Here your interquartile range is 0 so you won't get range bars. You'll just get points outside that region marked as outliers with points. That's just how the traditional box and whisker plot works. – MrFlick Oct 23 '20 at 22:14
  • 2
    Ah, I see you tried to use `coef = 10` but if your IQR is 0, it doesn't really matter what coef you use. You'll just get 0. – MrFlick Oct 23 '20 at 22:16
  • Ah, I went back to my main data and sure enough that's the case. Looks like I'll have to do the full ranges manually. Will experiment and post back code. – DanM Oct 23 '20 at 22:37
  • Actually, is there a way to get stat_boxplot to do full ranges instead of IQR or multiple? – DanM Oct 23 '20 at 22:47

2 Answers2

1

I'm mainly going to be repeating what was said in the comments, but this way it's in an answer.

Your 'NO's are almost all 0.017. At least, enough of them are as to make the few that aren't outliers. This happens because the mean and both quartiles are 0.017. This also makes your IQR 0, and since the rangelines show 1.5*IQR, There won't be any. Therefore, your plots are correct. Just to display everything:

library(ggplot2)

data <- read.csv("~/Desktop/boxplot stack.csv")

ggplot(data, aes(x = smoke, y = LBXVBZ))+
  geom_boxplot()+
  labs(x="Smoker", y="Benzene")+
  ggtitle("Distribution of blood benzene levels among smokers/nonsmokers")+
  theme_bw()

enter image description here

This is exactly the same as you sent, but I just wanted to put everything down.

Érico Patto
  • 1,015
  • 4
  • 18
0

So with @MrFlick's point that I was using data with a null IQR, I couldn't get error bars the standard way. Following the guide at this SO post, I've revised the code as follows:

library(ggplot2)

femplot <- read.csv("femplot.csv")

#Special functions to allow for error bars to plot min/max due to compressed "NO" data
o <- function(x) {
  subset(x, x == max(x) | x == min(x))
}

f <- function(x) {
  r <- quantile(x, probs = c(0.00, 0.25, 0.5, 0.75, 1))
  names(r) <- c("ymin", "lower", "middle", "upper", "ymax")
  r
}

boxplot(LBXVBZ~smoke, data = femplot)

ggplot(data = femplot, aes(x=smoke, y=LBXVBZ))+
  stat_summary(fun.data=f, geom="boxplot") + 
  stat_summary(fun = o, geom="point") +
  stat_boxplot(geom='errorbar',width=0.5,coef=10)+
  stat_summary(fun = "mean", shape=23, color="red")+
  labs(x="Smoker", y=attr(femplot$LBXVBZ, "label"),
       title = "Distribution of blood benzene levels among smokers/nonsmokers")

This mostly works. Unfortunately the errorbar still insists on only drawing the end bars on the "YES" data bar and not the "NO", so something is still wrong and I would be indebted to anyone who can help me figure out why, but at least this now gives me full-range bars. Here's the output: Mostly-corrected output

DanM
  • 337
  • 3
  • 9