Writing code/function that matches column names by highest similarity

Question

I have five datasets, which cover the same topics over time.

library(data.table)
DT <- data.table(A= round(rnorm(10,10,10),2),
                 B= round(rnorm(10,10,10),2),
                 C= round(rnorm(10,10,10),2))
DT_2 <- data.table(A= round(rnorm(10,10,10),2),
                 B= round(rnorm(10,10,10),2),
                 C= round(rnorm(10,10,10),2),
                 D= round(rnorm(10,10,10),2))
DT_3 <- DT
DT_4 <- DT_2
DT_5 <- DT_2
names(DT)   <- c("something","nothing", "anything")
names(DT_2) <- c("some thing","no thing", "any thing", "number4")
names(DT_3) <- c("some thing wrong","nothing", "anything_")
names(DT_4) <- c("something","nothingg", "anything", "number_4")
names(DT_5) <- c("something","nothing", "anything happening", "number4")

However, each year, they are a little different. The names of the columns have slightly changed, some columns have been added and some have been removed. I would like to "rowbind" these datasets. However, every dataset has about 100 columns, and manually making all the column names consistent would be hell.

EDIT: Please note that the columns do not necessarily have the same index, such as is the case for example in the edited column names below, where DT_2 has the column XXX.

# EDIT
names(DT)<- c("something","nothing", "anything", "number4")
names(DT_2)<- c("some thing","no thing","XXX", "number4")
names(DT_3)<- c("some thing wrong","nothing", "anything_")
names(DT_4)<- c("something","nothingg", "anything", "number_4")
names(DT_5)<- c("something","nothing", "anything happening", "number4")

I thought it might be a better idea to write a function to do it for me.

I once asked some help with a function that did something similar here. The following function coalesces columns with capitalised and non capitalised versions of variables names without specifying the variable names.

Very neatly, it additionally specifies which var names were coalesced.

library(data.table)
library(magrittr) # piping is used to improve readability
names(DT_panel) %>% 
  data.table(orig = ., lc = tolower(.)) %>% 
  .[, {
    if (.N > 1L) {
      new <- toupper(.BY)
      old <- setdiff(orig, new)
      DT_panel[, (new) := fcoalesce(.SD), .SDcols = orig]
      DT_panel[, (old) := NULL]
      sprintf("Coalesced %s onto %s", toString(old), new)
    }
  }, by = lc]

In addition, I found this question here, which does a fuzzy join based on column entries.

library(fuzzyjoin); library(dplyr);

stringdist_join(a, b, 
                by = "name",
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(name.x) %>%
  top_n(1, -dist)

The problem is that I don't under stand either of these solution well enough to combine them in to a function providing the solution that I want.

Could anyone help me make a start? My desired output is as follows:

DT <- data.table(A= round(rnorm(10,10,10),2),
                 B= round(rnorm(10,10,10),2),
                 C= round(rnorm(10,10,10),2))
DT_2 <- data.table(A= round(rnorm(10,10,10),2),
                 B= round(rnorm(10,10,10),2),
                 C= round(rnorm(10,10,10),2),
                 D= round(rnorm(10,10,10),2))
D <- c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA)
DT_3 <- DT
DT_4 <- DT_2
DT_5 <- DT_2
DT <- cbind(DT, D)
DT_3 <- cbind(DT_3, D)
DT <- rbind (DT, DT_2, DT_3, DT_4, DT_5)
names(DT) <- c("something","nothing", "anything", "number4")

The similarities in your real data are such as the ones in the example? Because if it is, you could detect unique patterns such as "some", "any", "4" and "not". And it may be useful first, substracting all spaces, dashes and not alphabetical/numeric characters. — Santiago I. Hurtado, Oct 22 '20 at 13:13
Thank you for your comment Santiago. The similarities in my real data are slightly more complicated than the example. They might consist of three words, where two words are the same, but they are still different. For example: `Apples Red Price` , `Apples Red Quantity`. Where in the next dataset the latter is for example called `Apples Red Amount`. — Tom, Oct 23 '20 at 05:52

Paul · Accepted Answer · 2020-10-22T16:02:54.453

This method based on fuzzyjoin::stringdist_join. It handles both new and removed columns.

Start with some dummy data.

library(tidyverse)

df1 <- tibble("something" = 1,"nothing" = 2, "anything" = 3, "number4" = 4)
df2 <- tibble("some thing" = 1,"no thing" = 2,"XXX" = 99, "number4" = 4)
df3 <- tibble("some thing wrong" = 1,"nothing" = 2, "anything_" = 4)
df4 <- tibble("something" = 1,"nothingg" = 2, "anything" = 2, "number_4" = 4, "YYY" = 100)
df5 <- tibble("something" = 1,"nothing" = 2, "anything happening" = 2, "number4" = 4)

fuzzy_rowbind fuzzy combines two data frames. It uses fuzzyjoin::stringdist_join to identify which columns are most similar. The second data frame's columns are renamed and they are combined.

fuzzy_rowbind <- function(a, b, method = "cosine", max_dist = 0.9999) {
  a_name_df <- tibble(name = names(a))
  b_name_df <- tibble(name = names(b))
  
  fj <- 
    fuzzyjoin::stringdist_join(
      a_name_df,
      b_name_df, 
      by = "name",
      mode = "left",
      ignore_case = FALSE, 
      method = method, 
      max_dist = max_dist, 
      distance_col = "dist"
    ) %>%
    arrange(dist)
  
  name_mapping <- NULL
  while (nrow(fj) > 0 && !all(b_name_df$name %in% name_mapping$name.y)) {
    name_mapping <- bind_rows(name_mapping, fj %>% slice(1))
    
    fj <- fj %>% filter(!name.x %in% name_mapping$name.x, !name.y %in% name_mapping$name.y)
  }
  
  new_names <- setNames(name_mapping$name.y, name_mapping$name.x)
  
  b_renamed <- rename(b, new_names[!is.na(new_names)])
  
  enframe(new_names, name = "new_name", value = "original_name") %>%
    filter(new_name != original_name, !is.na(new_name)) %>%
    as.data.frame() %>%
    print()
  cat("\n")
  
  bind_rows(a, b_renamed)
}

For example, here is what happens when we combine df1 and df2.

fuzzy_rowbind(df1, df2)
#>    new_name original_name
#> 1 something    some thing
#> 2   nothing      no thing
#> 
#> # A tibble: 2 x 5
#>   something nothing anything number4   XXX
#>       <dbl>   <dbl>    <dbl>   <dbl> <dbl>
#> 1         1       2        3       4    NA
#> 2         1       2       NA       4    99

Next, define fuzzy_rowbind_all which can take a list of dataframes and combine them all together.

fuzzy_rowbind_all <- function(l) {
  last(accumulate(l, fuzzy_rowbind))
}

Here is fuzzy_rowbind_all used on our data frames.

fuzzy_rowbind_all(
  lst(df1, df2, df3, df4, df5)
)
#>    new_name original_name
#> 1 something    some thing
#> 2   nothing      no thing
#> 
#>    new_name    original_name
#> 1  anything        anything_
#> 2 something some thing wrong
#> 
#>   new_name original_name
#> 1  nothing      nothingg
#> 2  number4      number_4
#> 
#>   new_name      original_name
#> 1 anything anything happening
#> 
#> # A tibble: 5 x 6
#>   something nothing anything number4   XXX   YYY
#>       <dbl>   <dbl>    <dbl>   <dbl> <dbl> <dbl>
#> 1         1       2        3       4    NA    NA
#> 2         1       2       NA       4    99    NA
#> 3         1       2        4      NA    NA    NA
#> 4         1       2        2       4    NA   100
#> 5         1       2        2       4    NA    NA

I was half way testing your solution when I noticed you switched your approach haha. Testing the new one now. Thank you very much:) — Tom, Oct 22 '20 at 15:12
I am currently getting the following error: `Error: Can't combine ..1$variable > and ..2$variable . Run rlang::last_error() to see where the error occurred.` Even though I applied a cleaning function beforehand. So my best guess would be that it matched two columns that should not be a match, since they are really of a different type. Is it possible to make the solution "unmatch" these occurences ? — Tom, Oct 22 '20 at 15:20
Yeah, that can be done. Although - can you try it again? I updated it so that the matching is more accurate and possibly it might not try to combine those two columns now. — Paul, Oct 22 '20 at 15:28
I manually fished out the culprit. It was a factor variable which had a completely different name in two datasets. I corrected that name, after which is worked (probably because the right column was now the closest match). — Tom, Oct 22 '20 at 15:37
I am already super happy and grateful for your answer. I have one last question though. I need to be absolutely certain that the right columns are matched up. Is it possible to print out which columns are matched so that I can manually go through the list to check for mistakes? Something like `sprintf("Coalesced %s onto %s", toString(old), new)` part of the function I posted? — Tom, Oct 22 '20 at 15:40
This is absolutely amazing. I have already accepted your answer and I will award you the bounty tomorrow (I have to another wait 17 hours). You helped me out a lot and I am really grateful! — Tom, Oct 22 '20 at 16:12

Writing code/function that matches column names by highest similarity

1 Answers1