3

This question is an extension to the StackOverflow question asked and answered here.

My circumstances are different in that I want to calculate the percentile of each value within a vector of 50,000 (or more!) values. For example --

df <- data.frame(val = rnorm(n = 50000, mean = 50, sd = 20))
df$val.percentile <- sapply(X = df$val, function(x) ecdf(df$val)(x))
head(df)

Is there a good way to optimize the process for calculating the percentile for each value? Essentially I'd like to make it as efficient as possible so the run time is as small as possible.

Community
  • 1
  • 1
AQS
  • 33
  • 1
  • 3

2 Answers2

5

ecdf is already vectorized, there is no reason to use an apply function. You can simply run:

df$val.percentile <- ecdf(df$val)(df$val)
Barker
  • 2,074
  • 2
  • 17
  • 31
4

You can implement dplyr::percent_rank() to rank each value based on the percentile. This is different, however, from determining the rank based on a cumulative distribution function dplyr::cume_dist() (Proportion of all values less than or equal to the current rank).

Reproducible example:

set.seed(1)
df <- data.frame(val = rnorm(n = 1000000, mean = 50, sd = 20))

Show that percent_rank() differs from cume_dist() and that cume_dist() is the same as ecdf(x)(x):

library(tidyverse)

head(df) %>% 
  mutate(pr  = percent_rank(val), 
         cd  = ecdf(val)(val), 
         cd2 = cume_dist(val))

       val  pr        cd       cd2
1 37.47092 0.4 0.5000000 0.5000000
2 53.67287 0.6 0.6666667 0.6666667
3 33.28743 0.0 0.1666667 0.1666667
4 81.90562 1.0 1.0000000 1.0000000
5 56.59016 0.8 0.8333333 0.8333333
6 33.59063 0.2 0.3333333 0.3333333

Speed of each approach for this example dataset is roughly similar, not exceeding a factor of 2:

library(microbenchmark)
mbm <- microbenchmark(
    pr_dplyr = mutate(df, pr = percent_rank(val)),
    cd_dplyr = mutate(df, pr = percent_rank(val)),
    cd_base  = mutate(df, pr = ecdf(val)(val)),
    times = 20
)

autoplot(mbm)

enter image description here

Rich Pauloo
  • 7,734
  • 4
  • 37
  • 69
timtrice
  • 307
  • 3
  • 12