Use of column inside sum() function using dplyr's mutate() function

Question

I have a data frame and I want to create a new column prob using dplyr's mutate() function. prob should include the probability P(row value > all column values) that there are rows of greater value in the data frame than each row value. Here is what I want to do:

data = data.frame(value = c(1,2,3,3,4,4,4,5,5,6,7,8,8,8,8,8,9))

require(dplyr)

data %>% mutate(prob = sum(value < data$value) / nrow(data))

This gives the following results:

   value prob
1      1    0
2      2    0
3      3    0
4      3    0
...    ...  ...

Here prob only contains 0 for each row. If I replace value with 2 in the expression sum(value < data$value):

data %>% mutate(prob = sum(2 < data$value) / nrow(data))

I get the following results:

   value      prob
1      1 0.8823529
2      2 0.8823529
3      3 0.8823529
4      3 0.8823529
...    ...  ...

0.8823529 is the probability that there are rows of greater value than 2 in the data frame. The problem seems to be that the mutate() function doesn't accept the value column as a parameter inside the sum() function.

`mutate` ? `dplyr`? do you want `sapply(data$value,function(x)sum(x < data$value) / nrow(data))?` — agstudy, Oct 05 '14 at 08:52
@Simen, you could adapt agstudy's code a bit into dplyr: data %>% mutate(prob = sapply(value, function(x) sum(x < value) / nrow(data))) — KFB, Oct 05 '14 at 11:56
If it works, it could be the answer. Could you check/tick the answer so as to close the case? : ) — KFB, Oct 05 '14 at 14:55

score 4 · Accepted Answer · answered Oct 05 '14 at 14:54

4

adapt agstudy's code a bit into dplyr:

data %>% mutate(prob = sapply(value, function(x) sum(x < value) / nrow(data)))

answered Oct 05 '14 at 14:54

KFB

3,501
3
15
18

2

Could simplify a little by using `mean()` – hadley Oct 09 '14 at 11:49

score 0 · Answer 2 · answered Oct 05 '14 at 15:49

I think a basic vapply (or sapply) would make much more sense here. However, if you really wanted to take the scenic route, you can try something like this:

data = data.frame(value = c(1,2,3,3,4,4,4,5,5,6,7,8,8,8,8,8,9))

data %>% 
  rowwise() %>%                ## You are really working by rows here
  do(prob = sum(.$value < data$value) / nrow(data)) %>%
  mutate(prob = c(prob)) %>%   ## The previous value was a list -- unlist here
  cbind(data)                  ## and combine with the original data
#          prob value
# 1  0.94117647     1
# 2  0.88235294     2
# 3  0.76470588     3
# 4  0.76470588     3
# 5  0.58823529     4
# 6  0.58823529     4
# 7  0.58823529     4
# 8  0.47058824     5
# 9  0.47058824     5
# 10 0.41176471     6
# 11 0.35294118     7
# 12 0.05882353     8
# 13 0.05882353     8
# 14 0.05882353     8
# 15 0.05882353     8
# 16 0.05882353     8
# 17 0.00000000     9

Use of column inside sum() function using dplyr's mutate() function

2 Answers2

Linked