1

I am working with the R programming language.

I have the following dataset:

set.seed(123)
library(dplyr)

var1 = rnorm(10000, 100,100)
var2 = rnorm(10000, 100,100)
var3 = rnorm(10000, 100,100)
var4 = rnorm(10000, 100,100)
var5 <- factor(sample(c("Yes", "No"), 1000, replace=TRUE, prob=c(0.4, 0.6)))
var6 <- factor(sample(c("Yes", "No"), 1000, replace=TRUE, prob=c(0.4, 0.6)))

my_data = data.frame( var1, var2, var3, var4, var5, var6)

I want to calculate "grouped percentiles" (e.g. at arbitrary levels) for different columns in this dataset according to the categorical variables.

Initially, I was trying to do this using a function - but this kept giving me a lot of difficulty (e.g. R: Difficulty Calculating Percentiles?).

As a result, I am trying to do this "manually" in the meantime. For instance, suppose:

  • Based on groupings of var5 and var6
  • I want to create a variable "class3" that splits var3 into groups of 10 percentiles
  • And I want to create a variable "class4" that splits var4 into groups of 20 percentiles

As an example, here are two different ways I am trying to do this:

Method 1: Produces some NA's?

library(dplyr)
final = my_data %>% group_by(var5, var6) %>%
  mutate(class3 = case_when(ntile(var3, 10) == 1 ~ paste0(round(min(var3), 2), " to ", round(quantile(var3, 0.1), 2), " decile 1"),
                            ntile(var3, 10) == 2 ~ paste0(round(quantile(var3, 0.1), 2), " to ", round(quantile(var3, 0.2), 2), " decile 2"),
                            ntile(var3, 10) == 3 ~ paste0(round(quantile(var3, 0.2), 2), " to ", round(quantile(var3, 0.3), 2), " decile 3"),
                            ntile(var3, 10) == 4 ~ paste0(round(quantile(var3, 0.3), 2), " to ", round(quantile(var3, 0.4), 2), " decile 4"),
                            ntile(var3, 10) == 5 ~ paste0(round(quantile(var3, 0.4), 2), " to ", round(quantile(var3, 0.5), 2), " decile 5"),
                            ntile(var3, 10) == 6 ~ paste0(round(quantile(var3, 0.5), 2), " to ", round(quantile(var3, 0.6), 2), " decile 6"),
                            ntile(var3, 10) == 7 ~ paste0(round(quantile(var3, 0.6), 2), " to ", round(quantile(var3, 0.7), 2), " decile 7"),
                            ntile(var3, 10) == 8 ~ paste0(round(quantile(var3, 0.7), 2), " to ", round(quantile(var3, 0.8), 2), " decile 8"),
                            ntile(var3, 10) == 9 ~ paste0(round(quantile(var3, 0.8), 2), " to ", round(quantile(var3, 0.9), 2), " decile 9"),
                            ntile(var3, 10) == 10 ~ paste0(round(quantile(var3, 0.9), 2), " to ", round(max(var3), 2), " decile 10"))) %>%

  mutate(class4 = case_when(ntile(var4, 20) == 1 ~ paste0(round(min(var4), 2), " to ", round(quantile(var4, 0.1), 2), " pcile 1"),
                            ntile(var4, 20) == 2 ~ paste0(round(quantile(var4, 0.1), 2), " to ", round(quantile(var4, 0.2), 2), " pcile 2"),
                            ntile(var4, 20) == 3 ~ paste0(round(quantile(var4, 0.2), 2), " to ", round(quantile(var4, 0.3), 2), " pcile 3"),
                            ntile(var4, 20) == 4 ~ paste0(round(quantile(var4, 0.3), 2), " to ", round(quantile(var4, 0.4), 2), " pcile 4"),
                            ntile(var4, 20) == 5 ~ paste0(round(quantile(var4, 0.4), 2), " to ", round(quantile(var4, 0.5), 2), " pcile 5")))

Method 2: Fewer NA's?

final = my_data %>% group_by(var5, var6) %>%  mutate(class3 = paste0(cut(var3, breaks = c(-Inf, quantile(var3, c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)), Inf), 
                                                                 labels = c("ptile 1", "ptile 2", "ptile 3", "ptile 4", "ptile 5", "ptile 6", "ptile 7", "ptile 8", "ptile 9", "ptile 10")),
                                                             " (", round(min(var3), 2), " to ", round(max(var3), 2), ")")) %>% 
    mutate(class4 = paste0(cut(var4, breaks = c(-Inf, quantile(var4, c(0.2, 0.4, 0.6, 0.8)), Inf), 
                               labels = c("ptile 1", "ptile 2", "ptile 3", "ptile 4", "ptile 5")),
                           " (", round(min(var4), 2), " to ", round(max(var4), 2), ")"))

I think that Method 2 might be more correct, seeing as there are fewer NA values being produced - but in the meantime, could someone please help me verify if I am doing this correctly (in Method 2) ... and if not, how can I correct this?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    I guess you already got a tidyverse solution , but you should look at the `quantile` function to understand what the base R solution would have been. – IRTFM Jan 01 '23 at 04:54
  • I do not get NA values with either of your methods, so I'm not sure the exact problem. The second method is incorrect because you are doing `min(...)` and `max(...)` calls in your character assignment which means all your `class3` variables wouldn't actually be including the range. I have edited my answer to include a [tag:data.table] solution which works well. – Cole Jan 01 '23 at 13:55
  • Do you want this: `my_data %>% group_by(var5, var6) %>% mutate(class3 = ntile(var3, n = 10), class4 = ntile(var4, n = 20))`? – Quinten Jan 05 '23 at 10:01

2 Answers2

2

We can easily calculate percentiles in R using the quantile() function, which uses the following syntax:

quantile(x, probs = seq(0, 1, 0.25))

x: a numeric vector whose percentiles we wish to find probs: a numeric vector of probabilities in [0,1] that represent the percentiles we wish to find

#create vector of 100 random values uniformly distributed between 0 and 500
data <- runif(100, 0, 500)

#Find the quartiles (25th, 50th, and 75th percentiles) of the vector
quantile(data, probs = c(.25, .5, .75))

#      25%       50%       75% 
# 97.78961 225.07593 356.47943 

#Find the deciles (10th, 20th, 30th, ..., 90th percentiles) of the vector
quantile(data, probs = seq(.1, .9, by = .1))

#      10%       20%       30%       40%       50%       60%       70%       80% 
# 45.92510  87.16659 129.49574 178.27989 225.07593 300.79690 337.84393 386.36108 
#      90% 
#423.28070

#Find the 37th, 53rd, and 87th percentiles
quantile(data, probs = c(.37, .53, .87))

#     37%      53%      87% 
#159.9561 239.8420 418.4787
stats_noob
  • 5,401
  • 4
  • 27
  • 83
1

For Method 1, I believe you mainly have NA values because you need to use ntile(..., 5) instead of ntile(..., 20).

final = my_data %>% group_by(var5, var6) %>%
  mutate(class3 = case_when(ntile(var3, 10) == 1 ~ paste0(round(min(var3), 2), " to ", round(quantile(var3, 0.1), 2), " decile 1"),
                            ntile(var3, 10) == 2 ~ paste0(round(quantile(var3, 0.1), 2), " to ", round(quantile(var3, 0.2), 2), " decile 2"),
                            ntile(var3, 10) == 3 ~ paste0(round(quantile(var3, 0.2), 2), " to ", round(quantile(var3, 0.3), 2), " decile 3"),
                            ntile(var3, 10) == 4 ~ paste0(round(quantile(var3, 0.3), 2), " to ", round(quantile(var3, 0.4), 2), " decile 4"),
                            ntile(var3, 10) == 5 ~ paste0(round(quantile(var3, 0.4), 2), " to ", round(quantile(var3, 0.5), 2), " decile 5"),
                            ntile(var3, 10) == 6 ~ paste0(round(quantile(var3, 0.5), 2), " to ", round(quantile(var3, 0.6), 2), " decile 6"),
                            ntile(var3, 10) == 7 ~ paste0(round(quantile(var3, 0.6), 2), " to ", round(quantile(var3, 0.7), 2), " decile 7"),
                            ntile(var3, 10) == 8 ~ paste0(round(quantile(var3, 0.7), 2), " to ", round(quantile(var3, 0.8), 2), " decile 8"),
                            ntile(var3, 10) == 9 ~ paste0(round(quantile(var3, 0.8), 2), " to ", round(quantile(var3, 0.9), 2), " decile 9"),
                            ntile(var3, 10) == 10 ~ paste0(round(quantile(var3, 0.9), 2), " to ", round(max(var3), 2), " decile 10"))) %>%
  
  
  mutate(class4 = case_when(ntile(var4, 5) == 1 ~ paste0(round(min(var4), 2), " to ", round(quantile(var4, 0.1), 2), " pcile 1"),
                            ntile(var4, 5) == 2 ~ paste0(round(quantile(var4, 0.2), 2), " to ", round(quantile(var4, 0.4), 2), " pcile 2"),
                            ntile(var4, 5) == 3 ~ paste0(round(quantile(var4, 0.4), 2), " to ", round(quantile(var4, 0.6), 2), " pcile 3"),
                            ntile(var4, 5) == 4 ~ paste0(round(quantile(var4, 0.6), 2), " to ", round(quantile(var4, 0.8), 2), " pcile 4"),
                            ntile(var4, 5) == 5 ~ paste0(round(quantile(var4, 0.8), 2), " to ", round(quantile(var4, 1), 2), " pcile 5")))

Edit: If I were approaching this from scratch, I would use this approach:

library(data.table)
dt = as.data.table(my_data)

dt[, c("class3", "class4") := {
      class3_cut = cut(var3, quantile(var3, seq(0, 1, .1)))
      class4_cut = cut(var4, quantile(var4, seq(0, 1, .2)))
      .(paste(class3_cut, as.integer(class3_cut)),
        paste(class4_cut, as.integer(class4_cut)))
      },
   by = .(var5, var6)]
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Why not use quantile? – IRTFM Jan 01 '23 at 04:55
  • @IRTFM instead of `cut`? As with your comment to OP, I tried to not steer too far away, but I did suggest an alternative [tag:data.table] solution with base R functions. – Cole Jan 01 '23 at 05:01
  • @ Cole: Would you say my "METHOD 2" is incorrect? thank you so much! – stats_noob Jan 05 '23 at 00:05
  • @stats_noob Method 2 is incorrect based on the range you provide in the `min(.)` to `max(.)` argument. My understanding is that you want the range of the particular bin as opposed to the range of the entire dataset. You could get rid of the label argument which then provides the ranges correctly. – Cole Jan 05 '23 at 12:22