0

I tried finding the answer from other questions but either because other questions were highly specific or worded in a confusing manner I was unable to find the exact information applicable to my situation. Here goes:

I have, say, two variables, and 100 observations of each:

V1 <- rnorm(100, 0, 1) 
V2 <- rpois(100, 4) 
data <- cbind(V1, V2)

I want to group the participants based on what quantile they fall into on one variable, say V1, and then compute a mean and standard deviation of V2 for each quantile group.

Key note: I want to create the groups based on how many standard deviations they are from the mean of V1. So my quantile groups should be roughly: bottom 2%, 2nd p-tile to 16th, 16th to 50th, 50th to 84th, 84th to 98th, and top 2%.

Dij
  • 1,318
  • 1
  • 7
  • 13
  • 1
    It's been a while since I've used this so I'm not sure how it works but you can add a `quantile` argument to the `breaks` part of `cut` to create an additional column labelling the data based on which quantile it appears it. I gave a go with the following but it's not right however the approach may be `data %>% mutate( quant = cut(V1, breaks = quantile(V1, prob = c(0.02, 0.16, 0.5, 0.84, 0.98)))`. This question deals with similar https://stackoverflow.com/questions/4126326/how-to-quickly-form-groups-quartiles-deciles-etc-by-ordering-columns-in-a – NColl Dec 18 '18 at 00:46
  • Thanks NColl, this helped a bit but manually creating the quantiles with the prob vector ended up giving me some NAs for some reason... this was very helpful though! – Dij Dec 18 '18 at 02:33

1 Answers1

0

Instead of calculating quantiles you can just unit scale the data and use integers as the cut-points for the categories.

We add a scaled column:

data <- data.frame(data, V3 = scale(V1))

Then split the data into categories with cut-points -3 to 3:

data$cats <- cut(data$V3, -3:3, labels = letters[1:6])

Finally we aggregate to get the mean and standard deviation of V2 for each group.

aggregate(V2 ~ cats, function(x) c(mean = mean(x), st.dev = sd(x)), data = data)

#  cats  V2.mean V2.st.dev
#1    a 4.666667  2.081666
#2    b 4.352941  2.343640
#3    c 4.030303  1.828333
#4    d 3.838710  1.714580
#5    e 4.000000  3.082207
#6    f 5.000000  2.645751
Joe
  • 8,073
  • 1
  • 52
  • 58