2

I'm early in the process of learning R. Say I have a data frame with a column named "Gender". If I want to retrieve all rows where Gender is "female" there are at least two ways I can do this:

FemaleSmokers <- df[df$Gender=="female", , drop = FALSE]
FemaleSmokers <- subset(df, Gender=="female")

1) Is there a best practice on when to use one over the other? 2) In the first approach, why do I need to preface the column with the name of the data frame when R should know which data frame I working with.

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
Randy Minder
  • 47,200
  • 49
  • 204
  • 358
  • 1
    Welcome to R, May be this answer here gives you some insight about your Q.http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset – user5249203 Jun 27 '16 at 21:19
  • 1
    On the help page for [`?subset`](http://www.inside-r.org/r-doc/base/subset), you'll see this under the "Warning" section: "This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences." – Jota Jun 27 '16 at 21:22
  • @user5249203 - Your link helps with my first question, but not the second. – Randy Minder Jun 27 '16 at 21:24
  • Other alternative would be dplyr: `FemaleSmokers <- df %>% dplyr::filter(Gender == "female")` – Alejandro Alcalde Jun 27 '16 at 21:28
  • @user5249203 - That doesn't really answer my question. Yes I know I'm indexing the column. But why do I need to specify the name of the data frame when R should know what the name of the data fram already is, since I'm indexing against it. – Randy Minder Jun 27 '16 at 21:29
  • @RandyMinder, I think the reason is that you can index by anything you want, not neccesary against the `df`, you could do for example (A dummy one) `df[1==1]`. – Alejandro Alcalde Jun 27 '16 at 21:33
  • 1
    You need to specify the data.frame because you don't get any special evaluation within `[]`; it just takes index values, Booleans, or row/column names as strings. It doesn't operate on string names until everything is reduced to a single value, which it _then_ uses as a lookup value. That's actually good, as it lets you easily operate on those values. – alistaire Jun 27 '16 at 22:19
  • This side gives different possibilities to subset a data frame. https://www.r-bloggers.com/5-ways-to-subset-a-data-frame-in-r/ – ageans Jul 18 '19 at 15:15

2 Answers2

3

Hope this worked out example will help you

df<-data.frame( Name = c("mark", "joe", "cathy","zoya"), 
               Gender = c("Male","Male","Female", "Female"))
  Name Gender
1  mark   Male
2   joe   Male
3 cathy Female
4  zoya Female

subsetting of a dataframe (df) is done by 
df[row,column] 
For example, df[1:2,1:2]
 Name Gender
1 mark   Male
2  joe   Male

In your case, we are evaluating a condition on the dataframe
# both are valid
df[df$Gender == "Female",] or  df[df[,2] == "Female",] 

which is nothing but indexing the df as

df[c(3,4),] or df[c(FALSE,FALSE,TRUE,TRUE),]
df$Gender == "Female"
[1] FALSE FALSE  TRUE  TRUE

df[c(3,4),] Which basically rows 3 and 4, and all columns So, you are basically extracting variables to pass them as index. To extract variables of specific column from a data frame we use $ on dataframe help("$") help("[").

one more useful resource http://www.ats.ucla.edu/stat/r/modules/subsetting.htm

Rethinking about your Q, Why to preface the Column with df when R needs to know the df you are working with ! I could not have a better explanation than above, You need to extract the variable to pass row indexes where your condition has been evaluated TRUE. Probably in dataframe columns are not referred as variables.

But, I have a good news, where things work like you think. Where, columns are referred to as variables. It is datatable. Where columns are referred as variables, thus making easy to understand syntax for indexing, joining and other data manipulations. It is an amazing package, and easy to master it.

require(data.table)
DT<-data.table(df)
 Name Gender
1:  mark   Male
2:   joe   Male
3: cathy Female
4:  zoya Female

DT[Gender == "Female"]
    Name Gender
1: cathy Female
2:  zoya Female

Yes, you don't need to preface the df again, just passing columns. Best part is, it is more efficient, faster and easier to use compared to data.frame I hope it helps.

user5249203
  • 4,436
  • 1
  • 19
  • 45
0

How about using filter function from dplyr library?

FemaleSmokers <- filter(df, Gender=="female")
UseR10085
  • 7,120
  • 3
  • 24
  • 54