0

Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset

examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))

This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.

This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).

Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).

Anyone know how to do something like this?

dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L, 
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight", 
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA, 
5L), class = "data.frame")
Beth
  • 1
  • 1
  • It will be extremely challenging to answer your question without at least a sample of your data. Please [edit] your question with the output of `dput(examples)` or `dput(head(examples))` if your data is very large. See [How to make a great R reproducible example](https://stackoverflow.com/a/5963610/) for more. – Ian Campbell Apr 01 '21 at 16:16
  • Thought I could maybe do this using a for loop? – Beth Apr 04 '21 at 11:07

1 Answers1

0

The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.

First convert your columns to character:

library(dplyr)
examples %>%
   mutate(across(contains("example"),as.character)) %>%
   mutate(new_ID =  case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
                              TRUE ~ example_2))
#           example_1         example_2             new_ID
#1     sheilaovensnew    sheilowansknew     sheilowansknew
#2     sandramaymeres      candramymars       candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4    grarryfieldsred   grarryfieldsred    grarryfieldsred
#5         terrifrank        terryfrenk         terrifrank

In your dput output, somehow the name of example_1 was changed. I ran this first:

names(examples)[1] <- "example_1"
Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
  • Ok great this seems to be working now- printing the correct example depending on edit distance. Is there a was to get the code to work when the columns are not in the same order? E.g. for each row in example 1 it runs through the entirety of example2 and prints the closet name from example 2 provided it is an edit distance of 3 or less away? – Beth Apr 02 '21 at 11:16