1

I have a dataframe with ~34,000 rows. I have a identifier column containing ~2,300 unique values, that are repeated an arbitrary number of times for each value in this column.

I need to recode these values into something shorter. So far, I can only find examples that show how to recode unique values manually, which isn't really practical in this case.

A tibble: 2300 × 1
ID
<dbl>
24650010203
24650010203
24650010203
24650010203
24650010304
24650010405
24650010405
24650010405
24650010405
24650010506
etc...

What's the quickest and easiest way to recode all these values in a way that rows with the same identifier retain their identity? It can be something as simple as an integer range from 0001:2300, though I'd like all IDs to have the same number of digits.

E.g.

24650010203 --> 0001
24650010203 --> 0001
24650010203 --> 0001
24650010203 --> 0001
24650010304 --> 0002
24650010405 --> 0003
24650010405 --> 0003
24650010405 --> 0003
24650010405 --> 0003
24650010506 --> 0004
Pål Bjartan
  • 793
  • 1
  • 6
  • 18

1 Answers1

1

Besides from as.numeric(as.factor(x)), you can use data.table::rleid and format with sprintf:

df$ID <- sprintf("%04d", data.table::rleid(data$V1))
# [1] "0001" "0001" "0001" "0001" "0002" "0003" "0003" "0003" "0003" "0004"

data

data <- structure(list(V1 = c(24650010203, 24650010203, 24650010203, 
24650010203, 24650010304, 24650010405, 24650010405, 24650010405, 
24650010405, 24650010506)), class = "data.frame", row.names = c(NA, 
-10L))
Maël
  • 45,206
  • 3
  • 29
  • 67