Is there an existing function in R that sorts a continuous variable into an EQUAL number of observations per group?

Question

I have a 2319 row data frame df; I would like to sort the continuous variable var and split in into a specified number of groups with an equal (or as close as possible) number of observations per group. I have seen a similar post where cut2() from Hmisc was recommended, but it does not always provide an equal number of observations per group. For example, what I have using cut2()

df$Group <- as.numeric(cut2(df$var, g = 10))

var Group
1415 1
1004 1
1285 1
2099 2
2119 2
2427 4
...

table(df$Group)
  1   2   3   4   5   6   7   8   9  10 
232 232 241 223 233 246 219 243 226 224

Has anyone used/written something that does not rely on the underlying distribution of the variable (e.g. var), but rather the number of observations in the data and number of groups specified? I do have non-unique values.

What I want is a more equal number of observations, for example:

table(df$Group)
  1   2   3   4   5   6   7   8   9  10 
232 232 231 233 231 233 232 231 231 233

if you gave a [mcve] we could more easily test our answers and make sure they were what you wanted ... — Ben Bolker, Mar 22 '21 at 02:28
Also, could you please rephrase your question as "how do I ... ?" rather than "is there an existing function that ... ?" It would be more on-topic for SO. — Ben Bolker, Mar 22 '21 at 02:34
Possible duplicate: https://stackoverflow.com/questions/6104836/splitting-a-continuous-variable-into-equal-sized-groups — MrFlick, Mar 22 '21 at 05:01
This is not a duplicate question as the solution in that post was cut2() which does not result in an even number of obs per group (see my example above) — Aliv25, Mar 22 '21 at 10:43

Ronak Shah · Accepted Answer · 2021-03-22T13:18:47.330

1

cut/cut2 and other function depends on the distribution of the data to create groups. If you want more or less equal number of observations one option would be to use rep.

library(dplyr)

n <- 10

df %>%
  arrange(var) %>%
  mutate(Group = cummax(rep(seq_len(n), each = n()/n, length.out = n())))

edited Mar 22 '21 at 13:18

answered Mar 22 '21 at 03:24

Ronak Shah

377,200
20
156
213

does this work properly with 2319 observations and 10 groups? (It might, but I wouldn't be surprised if you had to do something clever to make sure the rounding came out OK) – Ben Bolker Mar 22 '21 at 04:01
1

It does work ok with any number of rows due to `length.out = n()`. The only issue could be that the last few values would repeat group = 1 again. – Ronak Shah Mar 22 '21 at 04:08
This is closer, though like you said the last few values (the largest values of var) get put back into group 1 with the smallest values so it will not work. It would work if those last values were placed in group 10. – Aliv25 Mar 22 '21 at 12:47
@Aliv25 Can you try the updated answer with `cummax` so that the largest value stays in the last group. – Ronak Shah Mar 22 '21 at 13:19
Worked! Thanks @RonakShah – Aliv25 Mar 22 '21 at 13:46

Is there an existing function in R that sorts a continuous variable into an EQUAL number of observations per group?

1 Answers1