1

I am using R to try to create a column in my dataframe called df that splits the data into 20 even groups, with the new column group having the corresponding group for each row. An example of my ordered data looks as such:

                preds ground_truth
65378  0.000002975379            0
27082  0.000004721652            0
26890  0.000006613435            1
130498 0.000007634303            0
173319 0.000007834359            0
20039  0.000009482496            0
64722  0.000009482496            0
53924  0.000009482496            0
165543 0.000009482496            0

I have asked a similar question before and there are similar answers, however the solutions do not work for some reason. The other answers are here:

Splitting a continuous variable into equal sized groups R divide data into groups

My solution was to use cut as such:

  df$group <- cut(index(df), 20, labels = FALSE)

I expected this to cut the dataframe index into 20 even groups, thus over the 129844 rows, there would be 6492 in each group. However this only produces a singular group, not splitting the data at all. Could someone explain why cut here is not working, where it has for the other dataframes?

Any extra information I would be happy to supply,

EDIT: I need the data groupings to be in order with respect to preds e.g. the first group will contain the highest 6492 values, the second the next highest 6492 and so on.

The data grouping must be ordered in the sense that the top group will Here is a dput of the first 10 rows:

structure(list(preds = c(0.00000297537922317814, 
0.00000472165221855588, 
0.0000066134351160987, 0.00000763430272198875, 0.00000783435945631941, 
0.00000948249581302744, 0.00000948249581314139, 0.00000948249581314247, 
0.00000948249581314704, 0.0000094824958131879), ground_truth = 
structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = 
"factor")), .Names = c("preds", 
"ground_truth"), row.names = c("65378", "27082", "26890", "130498", 
"173319", "20039", "64722", "53924", "165543", "168952"), class = 
"data.frame")
geds133
  • 1,503
  • 5
  • 20
  • 52
  • Where does the `index` function come from? – jdobres Sep 30 '20 at 13:31
  • @jdobres from the package zoo – geds133 Sep 30 '20 at 13:34
  • Does `cut(as.numeric(rownames(df)), 20, labels=F)` work for you? – jay.sf Sep 30 '20 at 13:37
  • It would also be helpful if you could share a reproducible example (sample data, code, and output). As currently written, your code should work as intended, so something else must be off. – jdobres Sep 30 '20 at 13:39
  • @jay.sf Unfortunately not. It gives an error saying `'from' must be a finite number' – geds133 Sep 30 '20 at 13:56
  • @jdobres The problem I often run into with making reproducible examples of stackoverflow is that my dataframes are far too large to be added in. Do you have a work around for this? – geds133 Sep 30 '20 at 13:58
  • Ok, we don't have the second sight, please provide a minimal reproducible example, here's our tutorial: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610. If you read carefully you'll see there's no need at all for large data. – jay.sf Sep 30 '20 at 13:58
  • You could provide a sample of the data, as long as that sample reproduces the error. `dput(df[sample(1:nrow(df), 100), ])` – jdobres Sep 30 '20 at 14:00
  • @jdobres I have included a dput of the first 10 rows above. – geds133 Sep 30 '20 at 14:36
  • @jay.sf I have included a dput of the first 10 rows above. – geds133 Sep 30 '20 at 14:36
  • Running your code with the sample data you've provided does not reproduce your error. – jdobres Sep 30 '20 at 14:38
  • 1
    @geds133 `cut(as.numeric(rownames(dat)), 20, labels=F)` works fine for me. – jay.sf Sep 30 '20 at 14:39
  • @jay.sf Quite right it does work for some reason I must have missed something so apologies. I have made an edit in the edit section which I should have included at the start. Is there a way to make these groups ordered in relation to preds? – geds133 Sep 30 '20 at 14:54

2 Answers2

2

How about just using some modular math?

If we had a data frame with 129844 rows:

df <- data.frame(a = runif(129844))

We can get each row assigned to one of 20 evenly-sized groups labelled 1 to 20 like this:

df$group <- factor(1 + (seq(nrow(df)) - 1) %/% (nrow(df) / 20))

And to prove it:

table(df$group)

#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
#> 6493 6492 6492 6492 6492 6493 6492 6492 6492 6492 6493 6492 6492 6492 6492 6493 6492 6492 6492 6492

Obviously 129844 is not evenly divisible by 20, so we have 4 groups that contain 6493 members.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
0

For equal-sized and ordered groups we can use ntile from the dplyr package:

df <- df %>%
  arrange(preds) %>%
  mutate(group = ntile(preds, 20))

              preds ground_truth group
65378  2.975379e-06            0     1
27082  4.721652e-06            0     2
26890  6.613435e-06            0     3
130498 7.634303e-06            0     4
173319 7.834359e-06            0     5
20039  9.482496e-06            0     6
64722  9.482496e-06            0     7
53924  9.482496e-06            0     8
165543 9.482496e-06            0     9
168952 9.482496e-06            0    10

As your sample only consists of 10 rows, there are just 10 groups. It should work for your whole data frame. Or see cut_number from the ggplot2 package:

df$group2 <- cut_number(df$preds, 20, labels = c(1:20))
Apl4n1
  • 27
  • 5