0

I have currently encountered a phenomenon in ggplot2, and I would be grateful if someone could provide me with an explanation.

I needed to plot a continuous variable on a histogram, and I needed to represent two categorical variables on the plot. The following dataframe is a good example.

library(ggplot2)


species <- rep(c('cat', 'dog'), 30)
numb <- rep(c(1,2,3,7,8,10), 10)
groups <- rep(c('A', 'A', 'B', 'B'), 15)

data <- data.frame(species=species, numb=numb, groups=groups)

Let the following code represent the categorisation of a continuous variable.

data$factnumb <- as.factor(data$numb)

If I would like to plot this dataset the following two codes are completely interchangable:

Note the difference after the fill= statement.

p <- ggplot(data, aes(x=factnumb, fill=species)) +
        facet_grid(groups ~ .) +
        geom_bar(aes(y=(..count..)/sum(..count..))) +
        scale_y_continuous(labels = scales::percent)

plot(p): enter image description here

q <- ggplot(data, aes(x=factnumb, fill=data$species)) +
        facet_grid(groups ~ .) +
        geom_bar(aes(y=(..count..)/sum(..count..))) +
        scale_y_continuous(labels = scales::percent)

plot(q): enter image description here

However, when working with real-life continuous variables not all categories will contain observations, and I still need to represent the empty categories on the x-axis in order to get the approximation of the sample distribution. To demostrate this, I used the following code:

data_miss  <- data[which(data$numb!= 3),]

This results in a disparity between the levels of the categorial variable and the observations in the dataset:

> unique(data_miss$factnumb)
[1] 1  2  7  8  10
Levels: 1 2 3 7 8 10

And plotted the data_miss dataset, still including all of the levels of the factnumb variable.

pm <- ggplot(data_miss, aes(x=factnumb, fill=species)) +
        facet_grid(groups ~ .) +
        geom_bar(aes(y=(..count..)/sum(..count..))) +
        scale_fill_discrete(drop=FALSE) +
        scale_x_discrete(drop=FALSE)+
        scale_y_continuous(labels = scales::percent)

plot(pm): enter image description here

qm <- ggplot(data_miss, aes(x=factnumb, fill=data_miss$species)) +
        facet_grid(groups ~ .) +
        geom_bar(aes(y=(..count..)/sum(..count..))) +
        scale_x_discrete(drop=FALSE)+
        scale_fill_discrete(drop=FALSE) +
        scale_y_continuous(labels = scales::percent)

plot(qm): enter image description here

In this case, when using fill=data_miss$species the filling of the plot changes (and for the worse).

I would be really happy if someone could clear this one up for me.

Is it just "luck", that in case of plot 1 and 2 the filling is identical, or I have stumbled upon some delicate mistake in the fine machinery of ggplot2?

Thanks in advance!

Kind regards,

Bernadette

Powkachu
  • 2,170
  • 2
  • 26
  • 36

1 Answers1

2

Using aes(data$variable) inside is never good, never recommended, and should never be used. Sometimes it still works, but aes(variable) always works, so you should always use aes(variable).

More explanation:

ggplot uses nonstandard evaluation. A standard evaluating R function can only see objects in the global environment. If I have data named mydata with a column name col1, and I do mean(col1), I get an error:

mydata = data.frame(col1 = 1:3)
mean(col1)
# Error in mean(col1) : object 'col1' not found

This error happens because col1 isn't in the global environment. It's just a column name of the mydata data frame.

The aes function does extra work behind the scenes, and knows to look at the columns of the layer's data, in addition to checking the global environment.

ggplot(mydata, aes(x = col1)) + geom_bar()
# no error

You don't have to use just a column inside aes though. To give flexibility, you can do a function of a column, or even some other vector that you happen to define on the spot (if it has the right length):

# these work fine too
ggplot(mydata, aes(x = log(col1))) + geom_bar()
ggplot(mydata, aes(x = c(1, 8, 11)) + geom_bar()

So what's the difference between col1 and mydata$col1? Well, col1 is a name of a column, and mydata$col1 is the actual values. ggplot will look for columns in your data named col1, and use that. mydata$col1 is just a vector, it's the full column. The difference matters because ggplot often does data manipulation. Whenever there are facets or aggregate functions, ggplot is splitting your data up into pieces and doing stuff. To do this effectively, it needs to know identify the data and column names. When you give it mydata$col1, you're not giving it a column name, you're just giving it a vector of values - whatever happens to be in that column, and things don't work.

So, just use unquoted column names in aes() without data$ and everything will work as expected.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294