-3

I am trying to understand a quirk with the subset() function in R and the use of the $ operator. I'll use the CO2 dataset in R as an example:

I can run

sub <- subset(CO2, CO2$Type=="Quebec")

without error to arrive at the same dataset as if I were to run

sub <- subset(CO2, Type=="Quebec")

However, I've observed that this is not always the case.

Sometimes including the $ within subset() function will produce the following error

$ operator is invalid for atomic vectors

What is triggering the '$ operator is invalid for atomic vectors' error? Why is it the $ allowed in some instances (like the CO2 example above) but not in others? (I'm particularly frustrated when I bring in my own data through read.csv() and sometimes I get the error when trying to subset with $ and sometimes I do not without any discernible pattern)

Thanks!

Per comments below, I'm attempting to post reproducible examples.

Here is the situation that triggers the error:

    Moose<-structure(list(Moose = 1:25, Tagging_Loc = structure(c(1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    Gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
    2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
    2L), .Label = c("F", "M"), class = "factor"), Age = c(20L, 
    23L, 14L, 15L, 10L, 9L, 5L, 10L, 19L, 22L, 21L, 21L, 7L, 
    16L, 19L, 9L, 23L, 5L, 9L, 10L, 16L, 8L, 13L, 14L, 6L), Weight = c(1366L, 
    1006L, 888L, 1359L, 899L, 635L, 400L, 1000L, 1012L, 1480L, 
    1001L, 1100L, 482L, 1414L, 971L, 725L, 1400L, 416L, 790L, 
    970L, 921L, 560L, 1103L, 904L, 669L), Distance = c(250.5, 
    410.239, 457.6402591, 245.8523, 430.9975, 308.8673107, 212.5212497, 
    414.2093545, 439.6581, 215.6491489, 464.2384, 425.4256828, 
    233.5635555, 207.98, 453.7098751, 390.0506365, 235.5212497, 
    207.368, 427.5084899, 443.0452824, 459.8999274, 274.6856592, 
    350.5661674, 456.9600032, 330.146)), .Names = c("Moose", 
"Tagging_Loc", "Gender", "Age", "Weight", "Distance"), class = "data.frame", row.names = c(NA, 
-25L))

sub_Moose<-subset(Moose, Moose$Tagging_Loc=="A")

sub_Moose<-subset(Moose, Tagging_Loc=="A")'

But if I only change the name of the dataset, both versions of subset() run fine - no error:

    mOose<-structure(list(Moose = 1:25, Tagging_Loc = structure(c(1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 1L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    Gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
    2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 
    2L), .Label = c("F", "M"), class = "factor"), Age = c(20L, 
    23L, 14L, 15L, 10L, 9L, 5L, 10L, 19L, 22L, 21L, 21L, 7L, 
    16L, 19L, 9L, 23L, 5L, 9L, 10L, 16L, 8L, 13L, 14L, 6L), Weight = c(1366L, 
    1006L, 888L, 1359L, 899L, 635L, 400L, 1000L, 1012L, 1480L, 
    1001L, 1100L, 482L, 1414L, 971L, 725L, 1400L, 416L, 790L, 
    970L, 921L, 560L, 1103L, 904L, 669L), Distance = c(250.5, 
    410.239, 457.6402591, 245.8523, 430.9975, 308.8673107, 212.5212497, 
    414.2093545, 439.6581, 215.6491489, 464.2384, 425.4256828, 
    233.5635555, 207.98, 453.7098751, 390.0506365, 235.5212497, 
    207.368, 427.5084899, 443.0452824, 459.8999274, 274.6856592, 
    350.5661674, 456.9600032, 330.146)), .Names = c("Moose", 
"Tagging_Loc", "Gender", "Age", "Weight", "Distance"), class = "data.frame", row.names = c(NA, 
-25L))

sub_Moose<-subset(mOose, mOose$Tagging_Loc=="A")

sub_Moose<-subset(mOose, Tagging_Loc=="A")
  • 1
    Could you please provide a reproducible example? – kaksat Feb 21 '17 at 21:27
  • 3
    The big reason you use `subset()` is so that you don't have to use `$`. You should **not** use `$` if the variable of interest is coming from the data.frame in the first parameter. You are likely getting an error when you have a column the same name as the data.frame. But it could happen for other reasons too. You really should provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) when asking for help – MrFlick Feb 21 '17 at 21:28
  • 2
    Also take care with using subset. As it says near the bottom of the help file, `?subset`, "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." – lmo Feb 21 '17 at 21:31
  • Apologies, this is my first post. CO2 is a dataset available in R. Why are my examples not considered reproducible? Also, I understand that `$` is not necessary with `subset()` but was trying to ascertain why the atomic vector error is triggered sometimes but not others. – Just Curious Feb 21 '17 at 21:51
  • 1
    @lmo I've never understood that warning. It's just a generic NSE warning. Same would apply to any `dplyr` function yet people use those all the time. Subset is fine if it's used the way it's meant to be used. There is no reason to fear it. – MrFlick Feb 21 '17 at 21:51
  • @JustCurious It's not reproducible because it doesn't reproduce the error that you supposedly get. The code you shared works just fine so it's hard to say what's going wrong in your particular case. – MrFlick Feb 21 '17 at 21:52
  • @MrFlick, I don't know how to share reproducible examples that trigger the error because sometimes the error is triggered and sometimes it's not, even on the exact same dataset. Part of my issue is that the behavior doesn't seem to be reproducible. I have a small Moose dataset I created in Excel. I bring it into R with `read.csv()`. If I name the dataset Moose and attempt to subset with the `$`, the error will appear. If I bring in the exact same data and name it mOose, everything runs fine. – Just Curious Feb 21 '17 at 22:02
  • @JustCurious, well the point is DONT USE $. Errors are rarely stochastic. If a problem isn't reproducible, it's not fixable. At best we can guess what might be wrong but that isn't helpful or productive in the long run. – MrFlick Feb 21 '17 at 22:04
  • @MrFlick, I've edited my original question to include what I hope are reproducible examples to illustrate my problem. Thanks again for all your insight. – Just Curious Feb 21 '17 at 22:17

1 Answers1

2

Don't use $ with subset! Either use

sub <- subset(CO2, Type=="Quebec")

or use

sub <- CO2[CO2$Type=="Quebec", ]

The subset() function works by evaulating all symbols in the environment of the data.frame. In your Moose example, your data.frame Moose has a column names Moose. So when you run

sub_Moose <- subset(Moose, Moose$Tagging_Loc=="A")

the expression Moose$Tagging_Loc=="A" is evaluated in the environment of the data.frame. In that data.frame, there is a column named Moose so that evaluates to the column vector before it finds the data.frame of the same name. Note that with() is a lot like subset() in that it evaulates the expression in the context of an environment or data.frame. Observe

class(Moose)
# [1] "data.frame"
with(Moose, class(Moose))
# [1] "integer"
class(Moose$Moose)
# [1] "integer"

So Moose$Tagging_Loc=="A" will only work when Moose is a data.frame, but when you use subset(), Moose is an integer vector because it's finding the column first.

MrFlick
  • 195,160
  • 17
  • 277
  • 295