8

What is the dplyr way to apply a function rowwise for some columns. For example I want to Grab all the V, columns and turn them into percents based on the row sums. I show how to do it in base. What about in a dplyr chain. It'd nice to see in data.table form as well (though preference would go to a dplyr solution here).

x <- data.frame(A=LETTERS[1:5], as.data.frame(matrix(sample(0:5, 25, T), ncol=5)))

data.frame(x[1], x[-1]/rowSums(x[-1]))


##   A        V1        V2        V3         V4         V5
## 1 A 0.1428571 0.2142857 0.2142857 0.35714286 0.07142857
## 2 B 0.2000000 0.2000000 0.1500000 0.20000000 0.25000000
## 3 C 0.3571429 0.2857143 0.0000000 0.07142857 0.28571429
## 4 D 0.1904762 0.2380952 0.1904762 0.23809524 0.14285714
## 5 E 0.2000000 0.2500000 0.1500000 0.25000000 0.15000000

library(dplyr)

props <- function(x) round(x/sum(x), 2)

# does not work
x %>%
    rowwise()
    mutate(props(matches("^.{2}$")))
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    Maybe `x %>% rowwise() %>% select(matches("^.{2}$")) %>% props %>% cbind(x[1], .)`? The second half isn't really dplyrey though – Rich Scriven Apr 09 '16 at 21:27
  • 2
    I'm not a dplyr expert, but can't you just use `rowSums` in dplyr too? Something like `props <- function(x, y) round(x/y, 2) ; x %>% mutate(Total = rowSums(.[-1])) %>% mutate_each(funs(./Total), -c(A, Total))`. Though both `rowSums` and `rowwise` should be inefficient. I would go with ```Reduce(`+`, .[-1]))``` instead, if you don't have `NA`s. – David Arenburg Apr 09 '16 at 21:30
  • @DavidArenburg Nice can you throw down as an answer. It works. – Tyler Rinker Apr 09 '16 at 21:39
  • Tyler, did you see [this](http://stackoverflow.com/questions/27354734/dplyr-mutate-rowsums-calculations-or-custom-functions)? I wonder if you want some general row wise solution or specifically sum by row? – David Arenburg Apr 09 '16 at 21:45
  • 1
    @Frank true no need. I'll remove, though it sould allow anyone to just run the code with no need to type library etc. to get dplyr. – Tyler Rinker Apr 09 '16 at 21:49
  • @DavidArenburg I hadn't seen that but what you demo-ed is exactly what I was after. Funny it comes from a known data.table guy. – Tyler Rinker Apr 09 '16 at 21:52
  • 2
    A "known data.table guy" who is near hadley in the SO dplyr answers leaderboard :) http://stackoverflow.com/tags/dplyr/topusers – Frank Apr 09 '16 at 22:11

3 Answers3

7

In data.table, you can do

library(data.table)
setDT(x)

x[, grep("^V",names(DT)) := .SD/Reduce(`+`, .SD), .SDcols = V1:V5]

   A         V1        V2        V3         V4         V5
1: A 0.28571429 0.0000000 0.2857143 0.07142857 0.35714286
2: B 0.23076923 0.2307692 0.3076923 0.15384615 0.07692308
3: C 0.44444444 0.0000000 0.4444444 0.00000000 0.11111111
4: D 0.07142857 0.3571429 0.1428571 0.07142857 0.35714286
5: E 0.00000000 0.2222222 0.3333333 0.44444444 0.00000000

To compute the denominator with NA values ignored, I guess rowSums is an option, though it will coerce .SD to a matrix as an intermediate step.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • It''s fine. I didn't invent ```Reduce(`+`,...```. I'm just wondering if this isn't a dupe? – David Arenburg Apr 09 '16 at 21:46
  • 1
    http://stackoverflow.com/questions/35306500/r-data-table-count-na-per-row/35306944 is relevant, though not exactly the same. – thelatemail Apr 09 '16 at 21:52
  • @DavidArenburg Seems it should be a dupe but I couldn't find one with an obvious title. – Tyler Rinker Apr 09 '16 at 21:53
  • 3
    @thelatemail We are doing these ```Reduce(`+`, .SD)``` for [some time now](http://stackoverflow.com/questions/30353894/replace-inf-in-r-data-table-show-number-of-inf-in-colums/30354129#30354129). Here's akrun [in 2014](http://stackoverflow.com/questions/26305233/how-can-i-specify-columns-in-r-to-be-used-in-matches-without-listing-each-indiv/26305482#26305482) – David Arenburg Apr 09 '16 at 21:57
  • 1
    Didn't realize you could pass column numbers on LHS of `:=`... I would have used `value = TRUE` – MichaelChirico Apr 10 '16 at 16:19
6

You can combine 's spread and gather with to get the following single pipeline:

x <- data.frame(A=LETTERS[1:5], as.data.frame(matrix(sample(0:5, 25, T), ncol=5)))

y <- x %>% 
        gather(V, val, -A) %>% 
        group_by(A) %>% 
        mutate(perc = val / sum(val)) %>% 
        select(-val) %>%
        spread(V, perc)

With tidy data it's quite easy to get any group-wise sum (rows, columns or any nested index-level) and compute percentages. The spread and gather will get you to and from your input data format.

Frank
  • 66,179
  • 8
  • 96
  • 180
TemplateRex
  • 69,038
  • 19
  • 164
  • 304
  • Ahh makes perfect sense. This is a "why didn't I think of that moment". – Tyler Rinker Apr 09 '16 at 22:26
  • doing a by-row grouping like this, my guess is this will slow down quickly as data grows – eddi Apr 11 '16 at 20:15
  • @eddi I haven't tested this on big data. dplyr probably isn't the best way for that any way. I doubt that it would be slower than the `data.frame` rescaling by `rowSums` that the OP showed. In the tidy data format, one could always do an `arrange(A)` before doing the `group_by(A)`, so that the data per group is sequentially processed. – TemplateRex Apr 12 '16 at 09:05
0

Another "tidyverse" solution is to use a select within a mutate. E.g.

library(tidyverse)

x <- data.frame(A=LETTERS[1:5], as.data.frame(matrix(sample(0:5, 25, T), ncol=5)))

x %>% 
  mutate(row_counts = select_if(., is.numeric) %>% rowSums()) %>% 
  mutate_at(vars(contains("V")), funs(./row_counts)) %>% 
  select(-row_counts)
#>   A        V1         V2        V3        V4        V5
#> 1 A 0.0000000 0.14285714 0.1428571 0.5714286 0.1428571
#> 2 B 0.0000000 0.62500000 0.1250000 0.1250000 0.1250000
#> 3 C 0.2222222 0.11111111 0.2222222 0.1111111 0.3333333
#> 4 D 0.3000000 0.50000000 0.1000000 0.1000000 0.0000000
#> 5 E 0.3333333 0.06666667 0.1333333 0.3333333 0.1333333

Created on 2019-02-16 by the reprex package (v0.2.1)

Bryan Shalloway
  • 748
  • 7
  • 15