-1

Below is the simplified example of my data, the column new is the one I want to add to the dataset. Basically, I want to rename the riskset using 1,2,3,4... so that it'll be easier for me to do loops. The numbers in riskset have been sorted, they are not continuous and there are duplicate numbers in the riskset (10~11 for each unique number), so how to add another column grouping the observations based on their riskset number?

  riskset            new
a.0000000017          1
a.0000000017          1
a.0000000017          1
a.0000000017          1
a.0000000066          2
a.0000000066          2
a.0000000066          2
a.0000000066          2
a.0000000099          3
a.0000000099          3
a.0000000099          3
a.0000000099          3
a.0000000099          3
              ......
Carlos Cavero
  • 3,011
  • 5
  • 21
  • 41

1 Answers1

0

A simple base R solution would be to use factors and then cast as.integer

as.integer(as.factor(df$riskset))
#[1] 1 1 1 1 2 2 2 2 3 3 3 3 3

Or since you've asked for a dplyr option

df %>% mutate(new = as.integer(as.factor(riskset)))

or the base R equivalent using transform

transform(df, new = as.integer(as.factor(riskset)))

Another option is to use dplyr::lag and cumsum

df %>% mutate(new = cumsum(riskset != lag(riskset, default = T)) + 1)

Or since you mentioned group_by you could do

df %>% group_by(riskset) %>% group_indices()

to generate the group indices.

group_indices also works within mutate

df %>% group_by(riskset) %>% mutate(new = group_indices())

Sample data

df <- read.table(text =
    "  riskset            new
a.0000000017          1
a.0000000017          1
a.0000000017          1
a.0000000017          1
a.0000000066          2
a.0000000066          2
a.0000000066          2
a.0000000066          2
a.0000000099          3
a.0000000099          3
a.0000000099          3
a.0000000099          3
a.0000000099          3", header = T)
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68