I am working with the R programming language.
I have the following dataset:
library(dplyr)
set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))
my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))
I then have this function used to calculate arbitrary percentiles for variables :
# source: https://stackoverflow.com/questions/74947154/r-using-dplyr-to-perform-conditional-functions
ptile <- function(x, n_percentiles) {
# Calculate the percentiles
pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(n_percentiles))
cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}
When I use this function sometimes:
# error not produced on this dataset, but on other datasets
na.omit(my_data) %>%
group_by(gender, status, country) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))
I get one of two errors:
Error in cut.default(x, breaks = pct, lables = labels, include.lowest = TRUE): invalid number of intervals
Error in cut.default(x, breaks = pct, labels = labels, include.lowest = TRUE) : 'breaks' are not unique
At first I thought that these errors are being produced because I am using this function on "groups of rows" - and in some of these rows, there might be too few rows for the desired percentile to be calculated?
I had originally thought that perhaps I could fix this problem by excluding "groups" with an insufficient number of rows:
na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))
But the same errors still persist.
I was wondering - is there some way to modify this percentile function such that when percentiles might not be able to be calculated at the desired level, the next closest level of percentiles can be calculated?
As an example, if I want percentiles at groups of 10 and this is not possible - perhaps percentiles at groups of 15 or groups of 20 might be possible?
As another example, suppose some group of observations (e.g. male, immigrant, country A) only has 1 observation and I want percentiles in groups of 10 - naturally, it seems that this is not possible. Without knowing that such a group exists in advance, is it possible to modify this ptile function such that it either ignores this group or just calculates the closest possible percentile (e.g. places everything into 1)?
In general, how can I change this ptile function so that these errors can be fixed?
Can someone please suggest a way to do this?
Thanks!
Note: I am also open to alternate ways to writing a function/solving this problem