1

Say, I have a dataset called iris. I want to create an indicator variable called sepal_length_group in this dataset. The values of this indicator will be p25, p50, p75, and p100. For example, I want sepal_length_group to be equal to "p25" for an observation if the Species is "setosa" and if the Sepal.Length is equal to or less than the 25th percentile for all species classified as "setosa". I wrote the following codes, but it generates all NAs:

library(skimr)

sepal_length_distribution <- iris %>% group_by(Species) %>% skim(Sepal.Length) %>% select(3, 9:12)

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2], "p25", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2] &
                                                Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3], "p50", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3] &
                                                        Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4], "p75", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4] &
                                                        Sepal.Length < sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),5], "p100", NA))

Any help will be highly appreciated!

Anup
  • 239
  • 2
  • 11
  • So break into quantiles by group? – camille May 19 '21 at 23:36
  • A few posts that should help: https://stackoverflow.com/q/60291876/5325862, https://stackoverflow.com/q/42948306/5325862 – camille May 19 '21 at 23:44
  • So you specifically want to use the skimr output? When you say an indicator variable do you mean that you basically want an ordered factor? – Elin Jun 15 '21 at 12:04

1 Answers1

2

This could be done simply by the use of the function cut as commented by @Camille

library(tidyverse)
iris %>%
  group_by(Species) %>%
  mutate(cat = cut(Sepal.Length, 
                   quantile(Sepal.Length, c(0,.25,.5,.75, 1)),
                   paste0('p', c(25,50, 75, 100)), include.lowest = TRUE))
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • thanks. This solves this particular problem of mine. But I was thinking of a more general case where I may want to create an indicator variable that will be based on a particular cell in a different dataframe. That's why the reason why I tried to use `which`. – Anup May 20 '21 at 01:24