Plotting inside function: subset(df,id_==...) gives wrong plot, df[df$id_==...,] is right

Question

I have a df with multiple y-series which I want to plot individually, so I wrote a fn that selects one particular series, assigns to a local variable dat, then plots it. However ggplot/geom_step when called inside the fn doesn't treat it properly like a single series. I don't see how this can be a scoping issue, since if dat wasn't visible, surely ggplot would fail?

You can verify the code is correct when executed from the toplevel environment, but not inside the function. This is not a duplicate question. I understand the problem (this is a recurring issue with ggplot), but I've read all the other answers; this is not a duplicate and they do not give the solution. geom_step doesn't display it properly like a single series

set.seed(1234)
require(ggplot2)
require(scales)

N = 10
df <- data.frame(x = 1:N,
                 id_ = c(rep(20,N), rep(25,N), rep(33,N)),
                 y = c(runif(N, 1.2e6, 2.9e6), runif(N, 5.8e5, 8.9e5) ,runif(N, 2.4e5, 3.3e5)),
                 row.names=NULL)

plot_series <- function(id_, envir=environment()) {
  dat <- subset(df,id_==id_)
  p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
  # Unsuccessfully trying the approach from http://stackoverflow.com/questions/22287498/scoping-of-variables-in-aes-inside-a-function-in-ggplot
  p$plot_env <- envir
  plot(p)
  # Displays wrongly whether we do the plot here inside fn, or return the object to parent environment 
  return(p)
}

 # BAD: doesn't plot geom_step!
plot_series(20)

# GOOD! but what's causing the difference?
ggplot(data=subset(df,id_==20), mapping=aes(x,y), color='red') + geom_step()

#plot_series(25)
#plot_series(33)

This is not a duplicate question. The behavior is different. See references. — smci, May 05 '14 at 21:16
Don't use `subset`. Just take the subset using `[`. (And I think you don't even need to do any fancy stuff with the environments in that case, you can just return the plot.) — joran, May 05 '14 at 21:21
@joran, try actually running that, it fails *Error in `[.data.frame`(df, df$id_ == id_) : undefined columns selected* — smci, May 05 '14 at 21:23
From the help page for subset: `This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like ‘[’, and in particular the non-standard evaluation of argument ‘subset’ can have unanticipated consequences.` — Dason, May 05 '14 at 21:27
I added the missing comma `df[df$id_==id_,]`, that evaluates, but doesn't fix the ggplot issue. — smci, May 05 '14 at 21:34
@Dason, joran: However R thinks they are identical: `subset(...drop=T)` and `[..., drop=T]` !! `dat.sub <- subset(df,id_==20, drop=T) ; dat.ind <- df[df$id_==20,] ; identical(dat.ind, dat.sub) TRUE!!` Truly insane!! What is the difference? — smci, May 05 '14 at 21:35
@joran, please post this as a solution, but for the love of God please explain why two objects which the stupid @#$%ing language swears on its ancestors' graves are identical, are in fact not identical, and thus plot very differently :S — smci, May 05 '14 at 21:41
Put a `browser()` call inside your function and see if they're identical in that environment... you'll see that they're not. — Gregor Thomas, May 05 '14 at 21:47
@smci - Are you doing that at the top level in the interpreter? There is a difference between using subset at the top level and from within a function. Give it a try - return the data after you subset using subset within in the function and you'll see that you *don't* have the same thing. — Dason, May 05 '14 at 21:47
Ok dudes, I get you. ***subset() considered harmful due to its nonstandard evaluation (of its 'subset' arg)***. Below I linked the Hadley essay on this and related discussion on SO. — smci, May 05 '14 at 22:39

score 5 · Accepted Answer · answered May 05 '14 at 21:46

This works fine:

plot_series <- function(id_) {
    dat <- df[df$id_ == id_,]
    p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
    return(p)
}

print(plot_series(20))

If you simply step through the original function using debug, you'll quickly see that the subset line did not actually subset the data frame at all: it returned all rows!

Why? Because subset uses non-standard evaluation and you used the same name for both the column name and the function argument. As jlhoward demonstrates above, it would have worked (but probably not been advisable) to have simply used different names for the two.

The reason is that subset evaluates with the data frame first. So all it sees in the logical expression is the always true id_ == id_ within that data frame.

One way to think about it is to play dumb (like a computer) and ask yourself when presented with the condition id_ == id_ how do you know what exactly each symbol refers to. It's ambiguous, and subset makes a consistent choice: use what's in the data frame.

score 3 · Answer 2 · answered May 05 '14 at 21:43

3

Notwithstanding the comments, this works:

plot_series <- function(z, envir=environment()) {
  dat <- subset(df,id_==z)
  p <- ggplot(data=dat, mapping=aes(x,y), color='red') + geom_step()
  p$plot_env <- envir
  plot(p)
  # Displays wrongly whether we do the plot here inside fn, or return the object to parent environment 
  return(p)
}

plot_series(20)

The problem seems to be the subset is interpreting id_ on the RHS of the == as identical to the LHS, to this is equivalent to subletting on T, which of course includes all the rows of df. That's the plot you are seeing.

answered May 05 '14 at 21:43

jlhoward

58,004
7
97
140

Ah. Thanks! `subset` is one f#$%ed-up function. That's bad behavior by the function. – smci May 05 '14 at 21:46
5

@smci Its only bad behavior if you explicitly misuse the function, contrary to the warnings in the documentation! It's quite useful, and works great when used at the command line. – joran May 05 '14 at 21:47
a) Much published code uses subset. b) I've been using R three years and there is so much language errata, glitches, caveats and version-specific, I didn't notice the caveat on subset, and it's fairly well-hidden. Discussion and link to Hadley essay on the dangers of subset: http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset – smci May 05 '14 at 21:49
2

It's not bad behavior at all! Subset evaluates it's arguments in the context of the dataframe in the first argument, so how else should `id_==id_` be interpreted? In fact, if subset interpreted the LHS as a column of `df` and the RHS as something else, *that would be bad behavior!!* – jlhoward May 05 '14 at 21:49
@jlhoward: because `ggplot(aes(x=x,y=y))` is perfectly well-defined and evaluates correctly. – smci May 05 '14 at 21:50
2

@smci Your comment is a non-sequitur: you are comparing a logical expression (in `subset(...)`) with specifying arguments to a function call (`aes(x=x)`). They are *completely different*. If, I write `aes(x=(x==x))` then the first `x` specifies the `x` argument to `aes(...)`. The *expression* in parens would evaluate to `TRUE`. – jlhoward May 05 '14 at 21:58
No, they're related. The syntax lesson learned is that `df[df$id_ == id_,]` works because the LHS `id_` is a (string) arg passed into `'['`/`getElement()` whereas the RHS `id_` is a variable. Different to `subset(df,id_==id_)` fails because the logical expression can't disambiguate the two. – smci May 05 '14 at 22:34
1

In `df[df$id_ == id_,]` the LHS is `df$id_` and the RHS is `id_`, so obviously they are different. If you used `df[id_ == id_,]` you would have the same problem as in `subset(...)`. The *exact same thing* happens with data tables, which are more forgiving. `DT[id_==20]` will return all rows with column id=20. But, e.g., `id=20; DT[id==id]` returns *all the rows*. – jlhoward May 05 '14 at 22:39
Well it's not 'obvious' in R. We use formulas in `lm()` with deferred interpretation. We use `aes()` with deferred interpretation. Certainly, if the interface was `subset(df,"id_==id_")` it would be explicitly ambiguous who gets to evaluate both the LHS and RHS `id_`, how, where, when and in what scope and environment. Look how much confusion `aes()/aes_string()` alone causes about who parses what when! – smci May 05 '14 at 22:42

Plotting inside function: subset(df,id_==...) gives wrong plot, df[df$id_==...,] is right

2 Answers2

Linked