2

How to recode a numeric variable with many values including missing values, to have numbers 0:n-1 where n is the number of unique values including NA, in a tidy way?

Example:

df <- tibble(x = c(1000, 1000, NA, 1001, 1002, 1003, NA, 1003))

Required output (though it can be any recoding scheme as long as the values are 0:n-1):

# A tibble: 8 x 2
      x     y
  <dbl> <dbl>
1  1000     0
2  1000     0
3    NA     4
4  1001     1
5  1002     2
6  1003     3
7    NA     4
8  1003     3

I am able to do this in a tidy way by first converting x to a factor then using fct_recode() with a named list (created automatically because remember there are many values in x), then back to numeric:

df <- df %>% mutate(x_fct = factor(case_when(
    is.na(x) ~ "level_na",
    TRUE ~ str_c("level_", x)
  )))
x_levels <- levels(df$x_fct)
n_levels <- length(x_levels)
names(x_levels) <- as.character(0:(n_levels - 1))
df <- df %>%
  mutate(y = as.numeric(fct_recode(x_fct, !!!x_levels)) - 1)
df
# A tibble: 8 x 3
      x x_fct          y
  <dbl> <fct>      <dbl>
1  1000 level_1000     0
2  1000 level_1000     0
3    NA level_na       4
4  1001 level_1001     1
5  1002 level_1002     2
6  1003 level_1003     3
7    NA level_na       4
8  1003 level_1003     3

But this seems very cumbersome. Surely there's a simpler way, preferably in a single pipe.

Giora Simchoni
  • 3,487
  • 3
  • 34
  • 72
  • 4
    `as.numeric(factor(x, exclude = NULL)) - 1`. Somewhat related: [How to create a consecutive group number](https://stackoverflow.com/questions/6112803/how-to-create-a-consecutive-group-number) – Henrik Feb 15 '21 at 12:46
  • 1
    A polite way to say duplicate, thanks @Henrik. – Giora Simchoni Feb 15 '21 at 12:57
  • But they don't explicitly consider `NA` - the devil is in the details. – Henrik Feb 15 '21 at 13:00

1 Answers1

2

One way would be to use match + unique. You can add sample to add randomness

library(dplyr)

df %>%
  mutate(level = paste('level', x, sep = '_'), 
         y = match(x, sample(unique(x))) - 1)

#      x level          y
#  <dbl> <chr>      <dbl>
#1  1000 level_1000     4
#2  1000 level_1000     4
#3    NA level_NA       2
#4  1001 level_1001     0
#5  1002 level_1002     1
#6  1003 level_1003     3
#7    NA level_NA       2
#8  1003 level_1003     3
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Beautiful, thanks. I'd change it to `df %>% mutate(y = match(x, unique(x)) - 1)` since the `level` variable in my code is just a helper and no randomness is required. – Giora Simchoni Feb 15 '21 at 12:51
  • Comparing with `microbenchmark` on `x` the length of 100K, @Ronak's answer is about x10 faster on my machine: `df <- tibble(x = sample.int(1000:10000, size = 100000, replace = TRUE)); microbenchmark::microbenchmark( match(df$x, unique(df$x)) - 1, as.numeric(factor(df$x, exclude = NULL)) - 1 )` – Giora Simchoni Feb 15 '21 at 13:04