Subsetting rows and columns for both data.frame and tbl_df

Question

Data.frame allows operations on column subsets using [ , dropping single column/row outputs to vectors by default. Dplyr does not allow this, deliberately (and seemingly because coding was an absolute nightmare).

df <- data.frame(a = c(1:5,NA), b = c(1,1,1,2,2,2))
mean(df[,"a"], na.rm = T) # 3

dftbl <- as.tbl(df)
mean(dftbl[,"a"], na.rm = T) # NA

Advice is therefore to subset with [[ as this will deliver uniform outputs for both dfs and tbl_dfs. But: that's fine for columns or rows only, but not for rows+columns, and concerningly this difference can be missed if you don't check the warnings (which is my own fault admittedly), e.g.:

dfresult <- mean(df[df$b == 2, "a"], na.rm = T) # 4.5
tblresult <- mean(dftbl[dftbl$b == 2, "a"], na.rm = T) # NA_real_

Does anyone have any 'best practice' suggestions for performing column operations on row subsets? Is this where I should improve my dplyr game using filter & select? My attempts thus far keep hitting walls. Grateful for any golden rules. Thanks in advance.

dftbl %>% filter(b == 2) %>% select(a) %>% mean(na.rm = T) #NA

This fails in the same way, with the filtered & selected data STILL being an N*1 tibble which refuses to play with mean.

dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% .$a
# [1]  4  5 NA

But

dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% mean(.$a, na.rm = T)
# [1] NA

akrun · Accepted Answer · 2019-09-20T19:50:52.863

2

The reason is because we need [[ instead of [ as with [ it is still a tibble with one column. The mean expects the input as a vector

mean(dftbl[["a"]], na.rm = TRUE) 
#[1] 3

Or use $

mean(dftb$a, na.rm = TRUE)

Regarding the second case, select also returns a tibble with the selected columns. Instead, we can use pull to extract as a vector

dftbl[dftbl$b == 2, "a"] %>% 
    pull(1)
#[1]  4  5 NA

Or if we don't want to load any libraries, use unlist

mean(unlist(dftbl[dftbl$b == 2, "a"]), na.rm = TRUE)
#[1] 4.5

For the code mentioned in the OP's post

dftbl %>% 
    filter(b == 2) %>% 
    select(a)  %>%
     .$a %>%
     mean(., na.rm = TRUE)
#[1] 4.5

Or with pull

dftbl %>%
    filter(b == 2) %>% 
    pull(a) %>%
    mean(na.rm = TRUE)
#[1] 4.5

edited Sep 20 '19 at 19:50

answered Sep 20 '19 at 19:29

akrun

874,273
37
540
662

thanks akrun, great overview of approaches. In the penultimate code block (which is the approach I'll try to use in future, or unlist), it seems unintuitive that one would need both select(a) and also .$a but I suppose this is me still getting my head around tibbles never dropping... but then why does select(a) %>% mean(.$a, na.rm = T) not work? Presumably the (., in the mean line [of your code] is pulling the unchanged result from the line above i.e. .$a (4 5 NA) so shouldn't mean() work the same? Cheers! – dez93_2000 Sep 20 '19 at 20:02
1

@dez93_2000. It is due to evaluation block. You can make it as a single block with `{}` i.e. `select(a) %>% {mean(.$a, na.rm = T)}` operators have precedence – akrun Sep 20 '19 at 20:04
interesting, thanks again. Only just seen {} used in some gganimate script, guess I'll get reading to find out more. Cheers – dez93_2000 Sep 23 '19 at 00:04

Subsetting rows and columns for both data.frame and tbl_df

1 Answers1