R selecting rows from dataframe using logical indexing: accessing columns by `$` vs `[]`

Question

I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.

I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.

However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.

Is there any guidelines about when to use $ vs [] operator ?

Edit: digging a little deeper and using this question as reference, it seems like the following code works correctly

df[which(df$col == 1), ]

however, not clear how to guard against NA and when to use which

Based on your examples I sort of wonder if you might be a little confused about the distinction between `[` and `[[` for lists (which includes data frames)? Because using single and double braces has different results. (See the top related question linked over at the right.) — joran, Jun 26 '17 at 17:34
If you are interested in using the data.table package, you can subset rows based on a logical condition very easily. See https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html — be_green, Jun 26 '17 at 17:41

score 3 · Answer 1 · answered Jun 26 '17 at 17:47

You confused many things.

In

df[,col]

col should be the column number. For example,

col = 2
x = df[,col]

would select the second column and store it to x.

In

df$col

col should be the column name. For example,

df=data.frame(aa=1:5,bb=10:14)
x = df$bb

would select the second column and store it to x. But you cannot write df$2.

Finally,

df[[col]]

is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:

df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]

Now x, y, and z are all contain the copy of the second column of df.

The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.

Piotr · Answer 2 · 2017-06-26T17:56:29.690

1

Use $ when you want to select one specific column by name df$col_name.

Use [] when you want to select one or more columns by number:

df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.

[[]] is mostly for lists.

EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.

Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.

edited Jun 26 '17 at 17:56

answered Jun 26 '17 at 17:36

Piotr

153
7

thanks for the clarification. @joran pointed out something similar in comments. Have edited the question -- sorry for the confusion – goofd Jun 26 '17 at 17:42
@goofd The use of `which` here would be to treat any `NA`s arising from the boolean comparison as FALSE. Otherwise, you'd be indexing by a boolean vector with (potentially) NA values in it, which would generate "NA rows". You will probably find there is some debate in the R community about this behavior. – joran Jun 26 '17 at 17:45
that's really interesting... these quirks with `NA` 's in each language always gets me – goofd Jun 26 '17 at 17:52

R selecting rows from dataframe using logical indexing: accessing columns by `$` vs `[]`

2 Answers2