0

Let's see if I can explain this clearly... Say I have a vector mtcars$mpg, if you do hist(mtcars$mpg) you see that there are 6 values between 10 and 15, 12 between 15 and 20,...

What I'm trying to do is find the values of mtcars$mpg that I can later use to separate the data into groups, where each group has the same number of data.
For instance, maybe 10, 16 and 22 allow to have 8 data between 10 and 16 and also 8 data between 16 and 22.

(I looked on SO but can't find any questions/answers that address this)

Anthony D
  • 457
  • 9
  • 16
  • Check `ntile` from `dplyr` https://www.rdocumentation.org/packages/BurStMisc/versions/1.1/topics/ntile – AntoniosK Jun 24 '19 at 12:48
  • Try `cut2` from the Hmisc package. cut2(x, g = 10) will cut the vector x into 10 equal sized groups. Also see https://stackoverflow.com/questions/6104836/splitting-a-continuous-variable-into-equal-sized-groups – Tony Ladson Jun 24 '19 at 13:05

1 Answers1

0

Since mpg is a continuous variable you can arbitrarily group the data by sorting the dataframe by its values, and then simply adding a grouping variable with rep(x, each = n). For example, using base R and n <- 8 for groups of 8:

df <- mtcars[order(mtcars$mpg),]
df$group <- rep(1:(nrow(df) / n), each = n)

Calling the following will return the first observation from each group, which is your cutoff, and join it to the original dataframe:

cutoffs <- aggregate(df$mpg, list(group = df$group), `[`, 1)
merge(df, cutoffs, by = "group")

#### OUTPUT ####

   group  mpg cyl  disp  hp drat    wt  qsec vs am gear carb    x
1      1 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 10.4
2      1 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 10.4
3      1 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4 10.4
4      1 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 10.4
5      1 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 10.4
6      1 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 10.4
7      1 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 10.4
8      1 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2 10.4
9      2 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2 15.5
10     2 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4 15.5
11     ...

If you feel comfortable with dplyr you can use ntile, left_join, and summarise:

library(dplyr)

mutate(mtcars, group = ntile(mpg, 4)) %>% 
    group_by(group) %>% 
    left_join(summarise(., cutoff = first(mpg, order_by = mpg)), by = "group") %>% 
    arrange(mpg)

#### OUTPUT ####

# A tibble: 32 x 13
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb group cutoff
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>  <dbl>
 1  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4     1   10.4
 2  10.4     8  460    215  3     5.42  17.8     0     0     3     4     1   10.4
 3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4     1   10.4
 4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     1   10.4
 5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4     1   10.4
 6  15       8  301    335  3.54  3.57  14.6     0     1     5     8     1   10.4
 7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3     1   10.4
 8  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2     1   10.4
 9  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2     2   15.5
10  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4     2   15.5
# … with 22 more rows
  • 1
    This is going in the right direction but not quite what I'm trying to achieve: find the cutoff values between each group. In the example above, it seems they are 15.2, 19.2, ... (then the idea is that I can create a column where rows with `mt$mpg` <15.2 will have the text value "<15.2" and so on) (of course, `mt@mpg` is a bad example because for instance between groups 2 and 3, the cutoff value, 19.2 is shared across both groups. yikes!) – Anthony D Jun 24 '19 at 13:46
  • @TonyD okay, I've edited my answer so the first values of each group are extracted. All you might need to do is convert your `cutoff` variable into a factor, add some text, etc. –  Jun 24 '19 at 14:22