How to recode a large number of repeated values in dataframe column

Question

I have a dataframe with ~34,000 rows. I have a identifier column containing ~2,300 unique values, that are repeated an arbitrary number of times for each value in this column.

I need to recode these values into something shorter. So far, I can only find examples that show how to recode unique values manually, which isn't really practical in this case.

A tibble: 2300 × 1
ID
<dbl>
24650010203
24650010203
24650010203
24650010203
24650010304
24650010405
24650010405
24650010405
24650010405
24650010506
etc...

What's the quickest and easiest way to recode all these values in a way that rows with the same identifier retain their identity? It can be something as simple as an integer range from 0001:2300, though I'd like all IDs to have the same number of digits.

E.g.

24650010203 --> 0001
24650010203 --> 0001
24650010203 --> 0001
24650010203 --> 0001
24650010304 --> 0002
24650010405 --> 0003
24650010405 --> 0003
24650010405 --> 0003
24650010405 --> 0003
24650010506 --> 0004

`as.numeric(as.factor(x))` then use format to prefix with zeros. — zx8754, Jan 25 '22 at 13:16

score 1 · Answer 1 · answered Jan 25 '22 at 13:21

1

Besides from as.numeric(as.factor(x)), you can use data.table::rleid and format with sprintf:

df$ID <- sprintf("%04d", data.table::rleid(data$V1))
# [1] "0001" "0001" "0001" "0001" "0002" "0003" "0003" "0003" "0003" "0004"

data

data <- structure(list(V1 = c(24650010203, 24650010203, 24650010203, 
24650010203, 24650010304, 24650010405, 24650010405, 24650010405, 
24650010405, 24650010506)), class = "data.frame", row.names = c(NA, 
-10L))

answered Jan 25 '22 at 13:21

Maël

45,206
3
29
67

1

be careful though with rleid, that would only give the desired results when sorted. Otherwise same values get a different id. – Merijn van Tilborg Jan 25 '22 at 13:41
Right, good point. – Maël Jan 25 '22 at 13:41

How to recode a large number of repeated values in dataframe column

1 Answers1