R: Difficulty Calculating Percentiles?

Question

I am working with the R programming language.

I have the following dataset:

library(dplyr)

set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))

my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))

I then have this function used to calculate arbitrary percentiles for variables :

# source: https://stackoverflow.com/questions/74947154/r-using-dplyr-to-perform-conditional-functions
ptile <- function(x, n_percentiles) {
  # Calculate the percentiles
  pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- sprintf("%.2f to %.2f percentile %d",
                    head(pct, -1), tail(pct, -1), seq_len(n_percentiles))

  cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}

When I use this function sometimes:

# error not produced on this dataset, but on other datasets

na.omit(my_data) %>%
    group_by(gender, status, country) %>%
    mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))

I get one of two errors:

Error in cut.default(x, breaks = pct, lables = labels, include.lowest = TRUE): invalid number of intervals
Error in cut.default(x, breaks = pct, labels = labels, include.lowest = TRUE) : 'breaks' are not unique

At first I thought that these errors are being produced because I am using this function on "groups of rows" - and in some of these rows, there might be too few rows for the desired percentile to be calculated?

I had originally thought that perhaps I could fix this problem by excluding "groups" with an insufficient number of rows:

na.omit(my_data) %>%
    group_by(gender, status, country) %>%
filter(n() < 5) %>%
    mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))

But the same errors still persist.

I was wondering - is there some way to modify this percentile function such that when percentiles might not be able to be calculated at the desired level, the next closest level of percentiles can be calculated?

As an example, if I want percentiles at groups of 10 and this is not possible - perhaps percentiles at groups of 15 or groups of 20 might be possible?

As another example, suppose some group of observations (e.g. male, immigrant, country A) only has 1 observation and I want percentiles in groups of 10 - naturally, it seems that this is not possible. Without knowing that such a group exists in advance, is it possible to modify this ptile function such that it either ignores this group or just calculates the closest possible percentile (e.g. places everything into 1)?

In general, how can I change this ptile function so that these errors can be fixed?

Can someone please suggest a way to do this?

Thanks!

Note: I am also open to alternate ways to writing a function/solving this problem

@ TarJae: thank you so much for your reply! I had previously consulted this post and tried to apply the "unique" function to my own code in an attempt to resolve this problem... but I don't think I did it correctly — stats_noob, Dec 30 '22 at 08:36
I think i encountered something new it is large therefore I will post it in the answer section . but it is not really an answer . — TarJae, Dec 30 '22 at 08:37

score 2 · Answer 1 · answered Dec 30 '22 at 08:38

This is not really an answer, but I am quite sure it will help.

First I thought this is the reason:

The reason is that in var 1 and var 2 length of break and label is not equal:

pct_var1 <- quantile(my_data$var1, probs = seq(0, 1, 1/10))
length(pct_var1)
# Create a character vector to store the labels
labels_var1 <- sprintf("%.2f to %.2f percentile %d",
                  head(pct, -1), tail(pct, -1), seq_len(10))
length(labels_var1)
cut_var1 <- cut(my_data$var1, breaks = pct, labels = labels, include.lowest = TRUE)


pct_var2 <- quantile(my_data$var2, probs = seq(0, 1, 1/5))
length(pct_var2)
# Create a character vector to store the labels
labels_var2 <- sprintf("%.2f to %.2f percentile %d",
                  head(pct, -1), tail(pct, -1), seq_len(5))
length(labels_var2)
cut_var2 <- cut(my_data$var2, breaks = pct, labels = labels, include.lowest = TRUE)

> length(pct_var1)
[1] 11
> length(labels_var1)
[1] 10
> length(pct_var2)
[1] 6               
> length(labels_var2)
[1] 5

In combination with this post Cut and labels/breaks length conflict it should be manageable to solve.

But then I encountered that your second data frame example:

na.omit(my_data) %>%
  group_by(gender, status, country) %>%
  filter(n() < 5)

removes all data

thank you for your answer! I will check this out in the morning! — stats_noob, Dec 30 '22 at 09:36
@ TarJae: I attempted to try and solve my own problem using some "heavy-handed" approaches ... can you please take a look at them if you have time and let me know what you think? Thank you so much! — stats_noob, Dec 31 '22 at 01:53

score 2 · Accepted Answer · answered Dec 31 '22 at 05:04

If we need to bypass some of these errors i.e. when the number of elements in the group are less, an option is to use tryCatch or purrr::possibly

library(dplyr)
library(purrr)
f_ptile <- possibly(ptile, otherwise = factor(NA_character_))

-testing

na.omit(my_data) %>%
     group_by(gender, status, country) %>%
     slice(1) %>% # this fails with original ptile function
     mutate(result1 = f_ptile(var1, 10), result2 = f_ptile(var2, 5))
# A tibble: 16 × 8
# Groups:   gender, status, country [16]
   gender status    disease country    var1   var2 result1 result2
   <fct>  <fct>     <fct>   <fct>     <dbl>  <dbl> <fct>   <fct>  
 1 Female Citizen   No      A       11902.  -2436. <NA>    <NA>   
 2 Female Citizen   Yes     B        3508.   6851. <NA>    <NA>   
 3 Female Citizen   Yes     C       16854.  -1769. <NA>    <NA>   
 4 Female Citizen   No      D       12372.   4363. <NA>    <NA>   
 5 Female Immigrant Yes     A        9635.    695. <NA>    <NA>   
 6 Female Immigrant No      B        3120.  -2674. <NA>    <NA>   
 7 Female Immigrant No      C       -1635.   3971. <NA>    <NA>   
 8 Female Immigrant No      D       -3163.   5802. <NA>    <NA>   
 9 Male   Citizen   No      A        3836.   8196. <NA>    <NA>   
10 Male   Citizen   No      B        6125.   8096. <NA>    <NA>   
11 Male   Citizen   Yes     C        2159.   9863. <NA>    <NA>   
12 Male   Citizen   No      D        3622.   8159. <NA>    <NA>   
13 Male   Immigrant No      A       -3003.   6874. <NA>    <NA>   
14 Male   Immigrant Yes     B         -27.1  4063. <NA>    <NA>   
15 Male   Immigrant No      C        4166.   2103. <NA>    <NA>   
16 Male   Immigrant No      D        5718.  17866. <NA>    <NA>

@ akrun: thank you so much for your answer! Why did you use the "slice()" function here? — stats_noob, Dec 31 '22 at 05:32
I have never heard of the "possibly" function before - is there any reason that you chose this function compared to "tryCatch"? — stats_noob, Dec 31 '22 at 05:33
@stats_noob It is just more compact to wrap existing function and create a new function (if you check the possibly code). The `slice` was just to show an edge case where your `ptile` will show an error. You can remove the `slice` step — akrun, Dec 31 '22 at 18:53
@ akrun: thank you so much for your reply! I will now start using "possibly()" instead of "tryCatch()". Earlier, I had asked a related question here - can you please take a look at it if you have time? Thank you so much! https://stackoverflow.com/questions/74925722/r-creating-a-function-to-identify-arbitrary-percentiles — stats_noob, Dec 31 '22 at 20:55

score 1 · Answer 3 · answered Dec 31 '22 at 01:48

Various attempts to solve this problem on my own:

Attempt 1:

final_answer = my_data %>% group_by(gender, status, country) %>% 
  mutate(result1 = case_when(ntile(var1, 10) == 1 ~ paste0(round(min(var1), 2), " to ", round(quantile(var1, 0.1), 2), " group 1"),
                            ntile(var1, 10) == 2 ~ paste0(round(quantile(var1, 0.1), 2), " to ", round(quantile(var1, 0.2), 2), " group 2"),
                            ntile(var1, 10) == 3 ~ paste0(round(quantile(var1, 0.2), 2), " to ", round(quantile(var1, 0.3), 2), " group 3"),
                            ntile(var1, 10) == 4 ~ paste0(round(quantile(var1, 0.3), 2), " to ", round(quantile(var1, 0.4), 2), " group 4"),
                            ntile(var1, 10) == 5 ~ paste0(round(quantile(var1, 0.4), 2), " to ", round(quantile(var1, 0.5), 2), " group 5"),
                            ntile(var1, 10) == 6 ~ paste0(round(quantile(var1, 0.5), 2), " to ", round(quantile(var1, 0.6), 2), " group 6"),
                            ntile(var1, 10) == 7 ~ paste0(round(quantile(var1, 0.6), 2), " to ", round(quantile(var1, 0.7), 2), " group 7"),
                            ntile(var1, 10) == 8 ~ paste0(round(quantile(var1, 0.7), 2), " to ", round(quantile(var1, 0.8), 2), " group 8"),
                            ntile(var1, 10) == 9 ~ paste0(round(quantile(var1, 0.8), 2), " to ", round(quantile(var1, 0.9), 2), " group 9"),
                            ntile(var1, 10) == 10 ~ paste0(round(quantile(var1, 0.9), 2), " to ", round(max(var1), 2), " group 10"))) %>% 
  mutate(result2 = case_when(ntile(var2, 10) == 1 ~ paste0(round(min(var2), 2), " to ", round(quantile(var2, 0.1), 2), " group 1"),
                            ntile(var2, 10) == 2 ~ paste0(round(quantile(var2, 0.1), 2), " to ", round(quantile(var2, 0.2), 2), " group 2"),
                            ntile(var2, 10) == 3 ~ paste0(round(quantile(var2, 0.2), 2), " to ", round(quantile(var2, 0.3), 2), " group 3"),
                            ntile(var2, 10) == 4 ~ paste0(round(quantile(var2, 0.3), 2), " to ", round(quantile(var2, 0.4), 2), " group 4"),
                            ntile(var2, 10) == 5 ~ paste0(round(quantile(var2, 0.4), 2), " to ", round(quantile(var2, 0.5), 2), " group 5"))

Attempt 2:

percentile_classifier <- function(x, n_percentiles) {
  # Calculate the percentiles
  percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))

  # Create a character vector to store the labels
  labels <- character(length(x))

  # Loop through each percentile and assign the corresponding label to each element in the vector
  for (i in 1:length(percentiles)) {
    lower <- percentiles[i]
    upper <- ifelse(i == length(percentiles), max(x), percentiles[i+1])
    label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
    labels[x >= lower & x < upper] <- label
  }

  # Return the labels
  return(labels)
}

final <- my_data %>% group_by(gender, status, country) %>% mutate(result1 = percentile_classifier(var1, 10))  %>% mutate(result2 = percentile_classifier(var2, 5))

I think @akrun has teached us once again how to do. :-) – TarJae Dec 31 '22 at 06:05 — TarJae, Dec 31 '22 at 06:05

R: Difficulty Calculating Percentiles?

3 Answers3

Linked