18

What is the most efficient way to update/replace NAs in main dataset with (correct) values in a lookup table? This is such a common operation! Similar questions do not seem to have tidy solutions.

Constraints: 1) Please assume a large number of missing values and bigger lookup table than the example given. So case-wise replacement operations would be impractical (no case_when, if_else, etc.)

2)The lookup table does not have all values of main dataframe, only the replacement ones.

Tidyverse solution answer much preferred. Similar questions do not seem to have tidy solutions.

library(tidyverse)

### Main Dataframe ###
df1 <- tibble(
  state_abbrev = state.abb[1:10],
  state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
  value = sample(500:1200, 10, replace=TRUE)
)


#> # A tibble: 10 x 3
#>    state_abbrev state_name value
#>    <chr>        <chr>      <int>
#>  1 AL           Alabama      551
#>  2 AK           Alaska       765
#>  3 AZ           Arizona      508
#>  4 AR           Arkansas     756
#>  5 CA           California   741
#>  6 CO           <NA>        1100
#>  7 CT           <NA>         719
#>  8 DE           <NA>         874
#>  9 FL           Florida      749
#> 10 GA           Georgia      580


### Lookup Dataframe ###
lookup_df <- tibble(
  state_abbrev = state.abb[6:8],
  state_name = state.name[6:8]
)

#> # A tibble: 3 x 2
#>   state_abbrev state_name 
#>   <chr>        <chr>      
#> 1 CO           Colorado   
#> 2 CT           Connecticut
#> 3 DE           Delaware

Ideally, a left_join would have a replacement option for missing values. Alas...

left_join(df1, lookup_df)
#> Joining, by = c("state_abbrev", "state_name")
#> # A tibble: 10 x 3
#>    state_abbrev state_name value
#>    <chr>        <chr>      <int>
#>  1 AL           Alabama      551
#>  2 AK           Alaska       765
#>  3 AZ           Arizona      508
#>  4 AR           Arkansas     756
#>  5 CA           California   741
#>  6 CO           <NA>        1100
#>  7 CT           <NA>         719
#>  8 DE           <NA>         874
#>  9 FL           Florida      749
#> 10 GA           Georgia      580

```

Created on 2018-07-28 by the reprex package (v0.2.0).

Nettle
  • 3,193
  • 2
  • 22
  • 26
  • You could just create a full look-up table and do a merge (left join)? – Roman Luštrik Jul 28 '18 at 14:59
  • 1
    If the lookup table is more than two columns, it's an update join, which doesn't exist in dplyr ([for now](https://github.com/tidyverse/tidyr/issues/183)). The basic functionality is built defined in `coalesce`, though, which can be applied programmatically if you don't want to specify lots of column names. – alistaire Jul 28 '18 at 15:21
  • @Ronak: This question marked as duplicate, though link does not go to tidy solution. – Nettle Jul 28 '18 at 17:21
  • @alistaire: your suggestion, I think? `left_join(df1, lookup_df, by = "state_abbrev") %>% mutate(merged.col = coalesce(df.merged$state_name.x, state_name.y) ) %>% select(state_abbrev, state_name= merged.col, value) ` – Nettle Jul 28 '18 at 19:46
  • More or less, yes. I wrote a blog post about it: https://alistaire.rbind.io/blog/coalescing-joins/ – alistaire Jul 28 '18 at 21:13
  • 1
    @alistaire: get this into dplyr at once! – Nettle Jul 28 '18 at 22:15
  • @Nettle did you check [this](https://stackoverflow.com/a/35636674/3962914) ? – Ronak Shah Jul 29 '18 at 03:08

5 Answers5

17

Picking up Alistaire's and Nettle's suggestions and transforming into a working solution

df1 %>% 
  left_join(lookup_df, by = "state_abbrev") %>% 
  mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
  select(-state_name.x, -state_name.y)
# A tibble: 10 x 3
   state_abbrev value state_name 
   <chr>        <int> <chr>      
 1 AL             671 Alabama    
 2 AK             501 Alaska     
 3 AZ            1030 Arizona    
 4 AR             694 Arkansas   
 5 CA             881 California 
 6 CO             821 Colorado   
 7 CT             742 Connecticut
 8 DE             665 Delaware   
 9 FL             948 Florida    
10 GA             790 Georgia

The OP has stated to prefer a "tidyverse" solution. However, update joins are already available with the data.table package:

library(data.table)
setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
df1
    state_abbrev  state_name value
 1:           AL     Alabama  1103
 2:           AK      Alaska  1036
 3:           AZ     Arizona   811
 4:           AR    Arkansas   604
 5:           CA  California   868
 6:           CO    Colorado  1129
 7:           CT Connecticut   819
 8:           DE    Delaware  1194
 9:           FL     Florida   888
10:           GA     Georgia   501

Benchmark

library(bench)
bm <- press(
  na_share = c(0.1, 0.5, 0.9),
  n_row = length(state.abb) * 2 * c(1, 100, 10000),
  {
    n_na <- na_share * length(state.abb)
    set.seed(1)
    na_idx <- sample(length(state.abb), n_na)
    tmp <- data.table(state_abbrev = state.abb, state_name = state.name)
    lookup_df <-tmp[na_idx] 
    tmp[na_idx, state_name := NA]
    df0 <- as_tibble(tmp[sample(length(state.abb), n_row, TRUE)])
    mark(
      dplyr = {
        df1 <- copy(df0)
        df1 <- df1 %>% 
          left_join(lookup_df, by = "state_abbrev") %>% 
          mutate(state_name = coalesce(state_name.x, state_name.y)) %>% 
          select(-state_name.x, -state_name.y)
        df1
      },
      upd_join = {
        df1 <- copy(df0)
        setDT(df1)[setDT(lookup_df), on = "state_abbrev", state_name := i.state_name]
        df1
      }
    )
  }
)
ggplot2::autoplot(bm)

enter image description here

data.table's upate join is always faster (note the log time scale).

As the update join modifies the data object, a fresh copy is used for each benchmark run.

Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
17

Here is a single line solution with rows_update():

df1 %>% 
  rows_update(lookup_df, by = "state_abbrev")

Demo:

library(dplyr)

### Main Dataframe ###
df1 <- tibble(
  state_abbrev = state.abb[1:10],
  state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
  value = sample(500:1200, 10, replace=TRUE)
)

### Lookup Dataframe ###
lookup_df <- tibble(
  state_abbrev = state.abb[6:8],
  state_name = state.name[6:8]
)

df1 %>% 
  rows_update(lookup_df, by = "state_abbrev")
#> # A tibble: 10 x 3
#>    state_abbrev state_name  value
#>    <chr>        <chr>       <int>
#>  1 AL           Alabama       532
#>  2 AK           Alaska        640
#>  3 AZ           Arizona       521
#>  4 AR           Arkansas      523
#>  5 CA           California    970
#>  6 CO           Colorado      695
#>  7 CT           Connecticut   504
#>  8 DE           Delaware     1088
#>  9 FL           Florida       979
#> 10 GA           Georgia      1059
HBat
  • 4,873
  • 4
  • 39
  • 56
  • 1
    Love this! Worth noting that `rows_update()` doesn't work (yet, I hope!) if the key is duplicated (like multiple rows with state_abbrev "CO"). Not a problem with this data, but I've got to find another solution for my problem. – Oliver May 11 '21 at 14:39
4

There's currently no one-shot for trying to coalesce more than one column (which can be done by using a lookup table approach within ifelse(is.na(value), ..., value)), though there has been discussion of how such behavior may be implemented. For now, you can build it manually. If you've got a lot of columns, you can coalesce programmatically, or even put it in a function.

library(tidyverse)

df1 <- tibble(
    state_abbrev = state.abb[1:10],
    state_name = c(state.name[1:5], rep(NA, 3), state.name[9:10]),
    value = sample(500:1200, 10, replace=TRUE)
)

lookup_df <- tibble(
    state_abbrev = state.abb[6:8],
    state_name = state.name[6:8]
)

df1 %>% 
    full_join(lookup_df, by = 'state_abbrev') %>% 
    bind_cols(map_dfc(grep('.x', names(.), value = TRUE), function(x){
        set_names(
            list(coalesce(.[[x]], .[[gsub('.x', '.y', x)]])), 
            gsub('.x', '', x)
        )
    })) %>% 
    select(union(names(df1), names(lookup_df)))
#> # A tibble: 10 x 3
#>    state_abbrev state_name  value
#>    <chr>        <chr>       <int>
#>  1 AL           Alabama       877
#>  2 AK           Alaska       1048
#>  3 AZ           Arizona       973
#>  4 AR           Arkansas      860
#>  5 CA           California    938
#>  6 CO           Colorado      639
#>  7 CT           Connecticut   547
#>  8 DE           Delaware      672
#>  9 FL           Florida       667
#> 10 GA           Georgia      1142
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • Does the `coalesce` approach only work with `NA` and not other kinds of missingness like empty strings? – James Jul 31 '18 at 15:35
  • Correct. An empty string is data in R, like an integer value of 0. Convert them to NA with `na_if`, if you need. – alistaire Jul 31 '18 at 22:18
1

in order to preserve the column order:

df1 %>% 
  left_join(lookup_df, by = "state_abbrev") %>% 
  mutate(state_name.x = coalesce(state_name.x, state_name.y)) %>% 
  rename(state_name = state_name.x) %>%
  select(-state_name.y)
gglee
  • 11
  • 2
-1

If the abbreviation column is complete and the lookup table is complete, could you just drop the state_name column and then join?

left_join(df1 %>% select(-state_name), lookup_df, by = 'state_abbrev') %>% 
  select(state_abbrev, state_name, value)

Another option could be to use match and if_else in a mutate call using the built in state name and abbreviation lists:

df1 %>% 
  mutate(state_name = if_else(is.na(state_name), state.name[match(state_abbrev,state.abb)], state_name))

Both give the same output:

# A tibble: 10 x 3
   state_abbrev state_name  value
   <chr>        <chr>       <int>
 1 AL           Alabama       525
 2 AK           Alaska        719
 3 AZ           Arizona      1186
 4 AR           Arkansas     1051
 5 CA           California    888
 6 CO           Colorado      615
 7 CT           Connecticut   578
 8 DE           Delaware      894
 9 FL           Florida       536
10 GA           Georgia       599       
sbha
  • 9,802
  • 2
  • 74
  • 62