0

So my first ggplot2 box plot was just one big stretched out box plot, the second one was correct but I don't understand what changed and why the second one worked. I'm new to R and ggplot2, let me know if you can, thanks.

#----------------------------------------------------------
#    This is the original ggplot that didn't work:
#----------------------------------------------------------
zSepalFrame <- data.frame(zSepalLength, zSepalWdth)
zPetalFrame <- data.frame(zPetalLength, zPetalWdth)

p1 <- ggplot(data = zSepalFrame, mapping = aes(x=zSepalWdth, y=zSepalLength, group = 4)) +  #fill = zSepalLength
  geom_boxplot(notch=TRUE) +
  stat_boxplot(geom = 'errorbar', width = 0.2) +
  theme_classic() +
  labs(title = "Iris Data Box Plot") +
  labs(subtitle ="Z Values of Sepals From Iris.R")

p1
#----------------------------------------------------------
#    This is the new ggplot box plot line that worked:
#----------------------------------------------------------

bp = ggplot(zSepalFrame, aes(x=factor(zSepalWdth), y=zSepalLength, color = zSepalWdth)) + geom_boxplot() + theme(legend.position = "none")
bp

This is what the ggplot box plot looked like

MrFlick
  • 195,160
  • 17
  • 277
  • 295
cocoakrispies93
  • 45
  • 1
  • 11
  • 2
    Why did you include `group = 4` in the `aes()` in the first one? That tells ggplot that all the values come from the same group (group #4 -- but you could have but any number there, it would have been the same.). – MrFlick Sep 04 '21 at 01:38
  • 1
    [See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with. Best we can do is guess until then, but you had a boxplot over a continuous variable (x-axis) when you generally want it to be grouped by a discrete variable – camille Sep 04 '21 at 01:38

1 Answers1

2

I don't have your precise dataset, OP, but it seems to stem from assigning a continuous variable to your x axis, when boxplots require a discrete variable.

A continuous variable is something like a numeric column in a dataframe. So something like this:

x <- c(4,4,4,8,8,8,8)

Even though the variable x only contains 4's and 8's, R assigns this as a numeric type of variable, which is continuous. It means that if you plot this on the x axis, ggplot will have no issue with something falling anywhere in-between 4 or 8, and will be positioned accordingly.

The other type of variable is called discrete, which would be something like this:

y <- c("Green", "Green", "Flags", "Flags", "Cars")

The variable y contains only characters. It must be discrete, since there is no such thing as something between "Green" and "Cars". If plotted on an x axis, ggplot will group things as either being "Green", "Flags", or "Cars".

The cool thing is that you can change a continuous variable into a discrete one. One way to do that is to factorize or force R to consider a variable as a factor. If you typed factor(x), you get this:

[1] 4 4 4 8 8 8 8
Levels: 4 8

The values in x are the same, but now there is no such thing as a number between 4 and 8 when x is a factor - it would just add another level.

That is in short why your box plot changes. Let's demonstrate with the iris dataset. First, an example like yours. Notice that I'm assigning x=Sepal.Length. In the iris dataset, Sepal.Length is numeric, so continuous.

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
  geom_boxplot()

enter image description here

This is similar to yours. The reason is that the boxplot is drawn by grouping according to x and then calculating statistics on those groups. If a variable is continuous, there are no "groups", even if data is replicated (like as in x above). One way to make groups is to force the data to be discrete, as in factor(Sepal.Length). Here's what it looks like when you do that:

ggplot(iris, aes(x=factor(Sepal.Length), y=Sepal.Width)) +
  geom_boxplot()

enter image description here

The other way to have this same effect would be to use the group= aesthetic, which does what you might think: it groups according to that column in the dataset.

ggplot(iris, aes(x=Sepal.Length), y=Sepal.Width, group=Sepal.Length)) +
  geom_boxplot()
chemdork123
  • 12,369
  • 2
  • 16
  • 32