Data.frame allows operations on column subsets using [
, dropping single column/row outputs to vectors by default. Dplyr does not allow this, deliberately (and seemingly because coding was an absolute nightmare).
df <- data.frame(a = c(1:5,NA), b = c(1,1,1,2,2,2))
mean(df[,"a"], na.rm = T) # 3
dftbl <- as.tbl(df)
mean(dftbl[,"a"], na.rm = T) # NA
Advice is therefore to subset with [[
as this will deliver uniform outputs for both dfs and tbl_dfs.
But: that's fine for columns or rows only, but not for rows+columns, and concerningly this difference can be missed if you don't check the warnings (which is my own fault admittedly), e.g.:
dfresult <- mean(df[df$b == 2, "a"], na.rm = T) # 4.5
tblresult <- mean(dftbl[dftbl$b == 2, "a"], na.rm = T) # NA_real_
Does anyone have any 'best practice' suggestions for performing column operations on row subsets? Is this where I should improve my dplyr
game using filter
& select
? My attempts thus far keep hitting walls. Grateful for any golden rules. Thanks in advance.
dftbl %>% filter(b == 2) %>% select(a) %>% mean(na.rm = T) #NA
This fails in the same way, with the filtered & selected data STILL being an N*1 tibble which refuses to play with mean
.
dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% .$a
# [1] 4 5 NA
But
dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% mean(.$a, na.rm = T)
# [1] NA