0

I already had a look here, where the cut function is used. However, I haven't been able to come up with a clever solution given my situation.

First some example data that I currently have:

df <- data.frame(
  Category = LETTERS[1:20], 
  Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90)
)

I would like to make a third column that forms a new category based on the Nber_within_category column. In this example, how can I make e.g. Category_new such that in each category, the Nber_within_category is at least 5 with the constrain that if Category already has Nber_within_category >= 5, that the original category is taken.

So for example, it should look like this:

df <- data.frame(
  Category = LETTERS[1:20], 
  Nber_within_category = c(rep(1,8), rep(2,3), rep(6,2), rep(10,3), 30, 50, 77, 90),
  Category_new = c(rep('a',5), rep('b', 4), rep('c',2), LETTERS[12:20])
)
Leonardo
  • 2,439
  • 33
  • 17
  • 31
Anonymous
  • 502
  • 4
  • 23
  • 1
    Keeping the original category when `Nber_within_category >= 5` is clear enough, but what was the logic behind the choice of categories for the other rows? I mean why would the first 5 rows - for example - have the same "new" category? – DS_UNI Feb 04 '19 at 14:35
  • That's allowed to be random, as long as their sum (of `Nbrer_within_category`) is +/- 5. So I tried doing it with a `cumsum`, but didn't get to an answer using that. So, whether you take the first 5 as a new category, or the 1st, 3rd, 5th, 7th, and 9th, does not matter for me. – Anonymous Feb 04 '19 at 14:54

1 Answers1

1

It's a bit of a hack, but it works:

df %>% 
  mutate(tmp = floor((cumsum(Nber_within_category) - 1)/5)) %>% 
  mutate(new_category = ifelse(Nber_within_category >= 5,
                               Category,
                               letters[tmp+1]))

The line floor((cumsum(Nber_within_category) - 1)/5) is a way of categorising the cumsum with bins of size 5 (-1 to include the rows where the sum is exactly 5), and which I'm using as an index to get new categories for the rows where Nber_within_category < 5

It might be easier to understand how the column tmp is defined if you run :

x <- 1:100
data.frame(x, y = floor((x- 1)/5))
DS_UNI
  • 2,600
  • 2
  • 11
  • 22