Why does ggplot2 see a data.frame and data_frame differently?

Question

I have two very similar data frames, which ggplot2 sees differently; although the contents are the same the data structures are subtly different. One is a data.frame, the other a data_frame. I'd like to understand the difference in how ggplot2 sees them. In the following examples, both are being used in a stat_function; the data.frame produces plots while the data_frame produces errors. This is particularly confusing in light of the interoperability of packages in the Hadleyverse. I first ran into this issue when I found that I was unable to create a plot from a data frame produced by dplyr (dplyr turns data.frames into data_frames) while a data frame I thought was identical (it wasn't, it was a data.frame) worked just fine.

Example 1

First, the working version from the data.frame.

library(ggplot2)
library(dplyr)

d.f <- data.frame(mean = 0, sd = 1)
d_f <- data_frame(mean = 0, sd = 1)

ggplot(data.frame(x=-3:3), aes(x)) +
  stat_function(fun = function (x) dnorm(x, mean = d.f[1,1], sd = d.f[1,2]))

And now the non-working version from the data_frame.

ggplot(data.frame(x=-3:3), aes(x)) +
  stat_function(fun = function (x) dnorm(x, mean = d_f[1,1], sd = d_f[1,2]))
## Warning message:
## Computation failed in `stat_function()`:
## Non-numeric argument to mathematical function

Example 2

This example produces a different error message though perhaps the underlying issue is the same. First, the working version with a data.frame.

logistic <- function (x) { 1/(1 + exp(-x)) }

d.f <- data.frame(b0 = -9, b1 = 0.8) 
d_f <- data_frame(b0 = -9, b1 = 0.8) 

ggplot(data.frame(x=0:20), aes(x)) +
  stat_function(fun = function (x) logistic(d.f[1,1] + d.f[1,2] * x))

And here's the non-working version with a data_frame.

ggplot(data.frame(x=0:20), aes(x)) +
  stat_function(fun = function (x) logistic(d_f[1,1] + d_f[1,2] * x))
## Error in eval(expr, envir, enclos) : object 'y' not found

Try `pull(d_f[1,2])`. It's still a tibble after subsetting.but ggplot is expecting a vector which is `pull` fixing. Have a look [here](https://stackoverflow.com/questions/21618423/extract-a-dplyr-tbl-column-as-a-vector) — Roman, Oct 10 '17 at 15:15
It's not ggplot. The truth is that `data_frame`s are **not** `data.frame`s in some important respects, and you've discovered one of them. Hadley decided that he didn't like some of the default behavior of `data.frame`s and so he intentionally made `data_frame`s behave differently. User beware. — joran, Oct 10 '17 at 15:26
...you can create the same error with `data.frame`s by doing `d.f[1,1,drop = FALSE]`, I think. — joran, Oct 10 '17 at 15:27
As other comments alluded to, see the different output from `d.f[1, 1]` and `d_f[1, 1]`. — aosmith, Oct 10 '17 at 15:34
`data_frames` are `data.frames`, they just don't return the same object when subsetted, with `data_frame` you need to be explicit of you want to change the class. `d_f[1,1]` is a `data_frame` (and thus a `data.frame` as well) in one case, and a numeric in the other. — moodymudskipper, Oct 10 '17 at 15:37

score 3 · Accepted Answer · answered Oct 10 '17 at 16:22

ggplot was seeing a data frame where it expected a value.

This resulted from differences between the data types returned by the subsetting square-bracket operator applied when applied to a data.frame or a tibble (the data frame preferred by Hadley's dplyr). Subsetting a data.frame can change types by default, e.g. returning a vector or value. Subsetting a tibble will return a tibble unless the user requests re-casting explicitly, e.g. by using pull or double-brackets [[]]. The error message "Non-numeric argument to mathematical function" should have been a clue.

The following code demonstrates this by appropriately re-casting the tibbles. library(ggplot2) library(dplyr)

d.f <- data.frame(mean = 0, sd = 1)
d_f <- data_frame(mean = 0, sd = 1)

Subsetting a tibble (aka tbl_df) returns a tbl_df.

class(d_f[1,1])
## [1] "tbl_df"     "tbl"        "data.frame"

Which can be re-cast with double square-brackets [[]] or pull.

class(d_f[[1,1]])
## [1] "numeric"
class(pull(d_f[1,1]))
## [1] "numeric"

Subsetting a data.frame returns a numeric vector.

class(d.f[1,1])
## [1] "numeric"

The behavior of subsetting a tibble, i.e. no re-casting, can be produced with the argument drop=FALSE.

class(d.f[1,1, drop=FALSE])
## [1] "data.frame"

Finally, showing that resolving the type issue resolves the plotting issue ...

ggplot(data.frame(x=-3:3), aes(x)) +
  stat_function(fun = function (x) dnorm(x, mean = pull(d_f[1,1]), sd = pull(d_f[1,2])))

and

ggplot(data.frame(x=-3:3), aes(x)) +
  stat_function(fun = function (x) dnorm(x, mean = d_f[[1,1]], sd = d_f[[1,2]]))

both produce the expected plot.

Why does ggplot2 see a data.frame and data_frame differently?

1 Answers1