1

I am trying to perform a conditional summation using R and dplyr using implicit column names. So starting with

> df <- data.frame(colA=c(1,2,1,1),
+                  colB=c(0,0,3,1),
+                  colC=c(0,1,2,3),
+                  colD=c(2,2,2,2))
> df
  colA colB colC colD
1    1    0    0    2
2    2    0    1    2
3    1    3    2    2
4    1    1    3    2

I am trying to apply the psuedocode:

foreach column c
    if(row.val > 1)
        calc += (row.val - 1)

I can accomplish this in a fairly straightforward manner using some simple base R subsetting:

> df.ans <- data.frame(calcA = sum(df$colA[df$colA > 1] - 1),
+                      calcB = sum(df$colB[df$colB > 1] - 1),
+                      calcC = sum(df$colC[df$colC > 1] - 1),
+                      calcD = sum(df$colD[df$colD > 1] - 1))
> df.ans
  calcA calcB calcC calcD
1     1     2     3     4

However I would like a solution that does not have to explicitly state the column names (colA, colB, etc.) because there are many and they may change in the future. If I were doing a simple sum the calculation would be possible with dplyr and:

df %>% 
summarise_all(funs(sum))

Things I have tried:

  • The filter_at components of dplyr but found that insufficient for this purpose because they take entire rows, whereas I am filtering rows per column independently.
  • This answer but found it insufficient because it uses explicit column names.
  • Conditionals inside a custom summarise function. This is probably the closest I have gotten but the evaluations always resolve to booleans which throw off the summation. For example summarise_all(funs(sum(. > 1)))
Blundering Ecologist
  • 1,199
  • 2
  • 14
  • 38
JHowIX
  • 1,683
  • 1
  • 20
  • 38

2 Answers2

2

You can translate the hard coding example to summarize_all pretty easily as, i.e, replace df$col.. with .:

df %>% summarise_all(~ sum(.[. > 1] - 1))

#  colA colB colC colD
#1    1    2    3    4

Or with the funs syntax:

df %>% summarise_all(funs(sum(.[. > 1] - 1)))

#  colA colB colC colD
#1    1    2    3    4
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • could you please elaborate on the purpose/need for the `~`. I tried this exact solution (minus the `~`) and spend a nontrivial amount of time researching the error "Can't create call to non-callable object" to no avail. The `summarize_all` documentation also has no examples where the `~` is used. – JHowIX Dec 29 '17 at 17:49
  • `~` is the purrr/tidyverse syntax for creating a function, which serves the same purpose as `funs`, i.e. `df %>% summarise_all(funs(sum(.[. > 1] - 1)))` . Both syntax should work in `0.7.4` (my current dplyr version), if you get an error, you can either switch to the `funs` syntax or upgrade `dplyr` and see if it helps. – Psidom Dec 29 '17 at 17:54
1

You can also use sapply from base R:

sapply(df,function(x)sum(x[x>1]-1))
colA colB colC colD 
   1    2    3    4 
Onyambu
  • 67,392
  • 3
  • 24
  • 53