2

I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.

I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.

However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.

Is there any guidelines about when to use $ vs [] operator ?

Edit: digging a little deeper and using this question as reference, it seems like the following code works correctly

df[which(df$col == 1), ]

however, not clear how to guard against NA and when to use which

goofd
  • 2,028
  • 2
  • 21
  • 33
  • Based on your examples I sort of wonder if you might be a little confused about the distinction between `[` and `[[` for lists (which includes data frames)? Because using single and double braces has different results. (See the top related question linked over at the right.) – joran Jun 26 '17 at 17:34
  • thanks have edited the question – goofd Jun 26 '17 at 17:40
  • If you are interested in using the data.table package, you can subset rows based on a logical condition very easily. See https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html – be_green Jun 26 '17 at 17:41

2 Answers2

3

You confused many things.

In

df[,col]

col should be the column number. For example,

col = 2
x = df[,col]

would select the second column and store it to x.

In

df$col

col should be the column name. For example,

df=data.frame(aa=1:5,bb=10:14)
x = df$bb

would select the second column and store it to x. But you cannot write df$2.

Finally,

df[[col]]

is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:

df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]

Now x, y, and z are all contain the copy of the second column of df.

The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.

user31264
  • 6,557
  • 3
  • 26
  • 40
1

Use $ when you want to select one specific column by name df$col_name.

Use [] when you want to select one or more columns by number:

  • df[,1] # select column with index 1
  • df[,1:3]# select columns with indexes 1 to 3
  • df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.

[[]] is mostly for lists.

EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.

Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.

Piotr
  • 153
  • 7
  • thanks for the clarification. @joran pointed out something similar in comments. Have edited the question -- sorry for the confusion – goofd Jun 26 '17 at 17:42
  • @goofd The use of `which` here would be to treat any `NA`s arising from the boolean comparison as FALSE. Otherwise, you'd be indexing by a boolean vector with (potentially) NA values in it, which would generate "NA rows". You will probably find there is some debate in the R community about this behavior. – joran Jun 26 '17 at 17:45
  • that's really interesting... these quirks with `NA` 's in each language always gets me – goofd Jun 26 '17 at 17:52