I don't have your precise dataset, OP, but it seems to stem from assigning a continuous variable to your x axis, when boxplots require a discrete variable.
A continuous variable is something like a numeric column in a dataframe. So something like this:
x <- c(4,4,4,8,8,8,8)
Even though the variable x
only contains 4's and 8's, R
assigns this as a numeric type of variable, which is continuous. It means that if you plot this on the x axis, ggplot
will have no issue with something falling anywhere in-between 4 or 8, and will be positioned accordingly.
The other type of variable is called discrete, which would be something like this:
y <- c("Green", "Green", "Flags", "Flags", "Cars")
The variable y
contains only characters. It must be discrete, since there is no such thing as something between "Green"
and "Cars"
. If plotted on an x axis, ggplot
will group things as either being "Green"
, "Flags"
, or "Cars"
.
The cool thing is that you can change a continuous variable into a discrete one. One way to do that is to factorize or force R
to consider a variable as a factor. If you typed factor(x)
, you get this:
[1] 4 4 4 8 8 8 8
Levels: 4 8
The values in x
are the same, but now there is no such thing as a number between 4 and 8 when x
is a factor - it would just add another level.
That is in short why your box plot changes. Let's demonstrate with the iris
dataset. First, an example like yours. Notice that I'm assigning x=Sepal.Length
. In the iris
dataset, Sepal.Length
is numeric, so continuous.
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_boxplot()

This is similar to yours. The reason is that the boxplot is drawn by grouping according to x and then calculating statistics on those groups. If a variable is continuous, there are no "groups", even if data is replicated (like as in x
above). One way to make groups is to force the data to be discrete, as in factor(Sepal.Length)
. Here's what it looks like when you do that:
ggplot(iris, aes(x=factor(Sepal.Length), y=Sepal.Width)) +
geom_boxplot()

The other way to have this same effect would be to use the group=
aesthetic, which does what you might think: it groups according to that column in the dataset.
ggplot(iris, aes(x=Sepal.Length), y=Sepal.Width, group=Sepal.Length)) +
geom_boxplot()