2

I have a data frame and I want to create a new column prob using dplyr's mutate() function. prob should include the probability P(row value > all column values) that there are rows of greater value in the data frame than each row value. Here is what I want to do:

data = data.frame(value = c(1,2,3,3,4,4,4,5,5,6,7,8,8,8,8,8,9))

require(dplyr)

data %>% mutate(prob = sum(value < data$value) / nrow(data))

This gives the following results:

   value prob
1      1    0
2      2    0
3      3    0
4      3    0
...    ...  ...

Here prob only contains 0 for each row. If I replace value with 2 in the expression sum(value < data$value):

data %>% mutate(prob = sum(2 < data$value) / nrow(data))

I get the following results:

   value      prob
1      1 0.8823529
2      2 0.8823529
3      3 0.8823529
4      3 0.8823529
...    ...  ...

0.8823529 is the probability that there are rows of greater value than 2 in the data frame. The problem seems to be that the mutate() function doesn't accept the value column as a parameter inside the sum() function.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Simen
  • 417
  • 6
  • 13
  • `mutate` ? `dplyr`? do you want `sapply(data$value,function(x)sum(x < data$value) / nrow(data))?` – agstudy Oct 05 '14 at 08:52
  • Thanks! Keep it simple – great idea... – Simen Oct 05 '14 at 09:01
  • 1
    @Simen, you could adapt agstudy's code a bit into dplyr: data %>% mutate(prob = sapply(value, function(x) sum(x < value) / nrow(data))) – KFB Oct 05 '14 at 11:56
  • If it works, it could be the answer. Could you check/tick the answer so as to close the case? : ) – KFB Oct 05 '14 at 14:55

2 Answers2

4

adapt agstudy's code a bit into dplyr:

data %>% mutate(prob = sapply(value, function(x) sum(x < value) / nrow(data)))
KFB
  • 3,501
  • 3
  • 15
  • 18
0

I think a basic vapply (or sapply) would make much more sense here. However, if you really wanted to take the scenic route, you can try something like this:

data = data.frame(value = c(1,2,3,3,4,4,4,5,5,6,7,8,8,8,8,8,9))

data %>% 
  rowwise() %>%                ## You are really working by rows here
  do(prob = sum(.$value < data$value) / nrow(data)) %>%
  mutate(prob = c(prob)) %>%   ## The previous value was a list -- unlist here
  cbind(data)                  ## and combine with the original data
#          prob value
# 1  0.94117647     1
# 2  0.88235294     2
# 3  0.76470588     3
# 4  0.76470588     3
# 5  0.58823529     4
# 6  0.58823529     4
# 7  0.58823529     4
# 8  0.47058824     5
# 9  0.47058824     5
# 10 0.41176471     6
# 11 0.35294118     7
# 12 0.05882353     8
# 13 0.05882353     8
# 14 0.05882353     8
# 15 0.05882353     8
# 16 0.05882353     8
# 17 0.00000000     9
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485