0

how do I subset data.frame data into three parts based on the values of one column? I want to show the u shape of a curve by building means within the different subsets. I already figured how to get a random top and bottom value, and how to get the top x and bottom x percent.. (e.g. 25%/50%/25%)

low.x <- top_n(final_data, -100, final_data$variablex)
high.x <- top_n(final_data, 100, final_data$variablex)

OR (sth. like... still gives me the wrong output for low.x)

n <- 25
low.x <- subset(final_data, final_data$variablex < quantile(final_data$variablex, prob = 1 - n/100))
high.si <- subset(final_data, final_data$variablex > quantile(final_data$variablex, prob = 1 - n/100))

But... How do I build the subsets based on lower 25%, main 50% and top 75%?

Thank you!

  • 1
    `subset(final_data,variablex – user2974951 Dec 12 '18 at 12:51
  • Without having a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), I'll guess that you can use `quantile` to find the break values, then use that as the `breaks` argument to `cut` – camille Dec 12 '18 at 13:01

1 Answers1

3

Create a grouping variable g based on the quantiles quant and then split the data by it. The input need not be sorted.

x <- 1:12 # test data

quant <- quantile(x, c(0, .25, .75, 1))
g <- cut(x, quant, include.lowest = TRUE, lab = c("lo", "mid", "hi"))
split(x, g)

giving:

$`lo`
[1] 1 2 3

$mid
[1] 4 5 6 7 8 9

$hi
[1] 10 11 12

quantcut

This could alternately be done in a more compact form using quantcut from gtools. This also does more sophisticated processing of duplicates.

library(gtools)

g <- quantcut(x, c(0, .25, .75, 1), lab = c("lo", "mid", "hi"))
split(x, g)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • 1
    I'd use `findInterval` (more efficient) instead of `cut`. – nicola Dec 12 '18 at 13:02
  • Good idea if performance is critical but `cut` is more flexible being able to assign labels and the performance is so fast that with x = 1:1000 system.time shows the entire processing at taking 0 seconds so the speed advantage may be negligible. – G. Grothendieck Dec 12 '18 at 13:39