Find matching string patterns

Question

Suppose I have the following data:

example = tibble::tibble(
  id = 1:10,
  vac = c("FFizer", "sinovasm", "aztraseneca", "phiser", "sonovac",
          "faizer", "sinivasc", "astraseneca", "sinocav", "aztraxeneca")
)

Which looks like this:

# A tibble: 10 x 2
      id vac        
   <int> <chr>      
 1     1 FFizer     
 2     2 sinovasm   
 3     3 aztraseneca
 4     4 phiser     
 5     5 sonovac    
 6     6 faizer     
 7     7 sinivasc   
 8     8 astraseneca
 9     9 sinocav    
10    10 aztraxeneca

And I want to find if the variable lab matchs in some degree with any option from a vector.

Say the vector to use as identifier is:

labs = c("sinovac", "pfizer", "astrazeneca")

Crossing example data.frame with the vector labs should give some output like this:

correction = tibble::tibble(
  id = 1:10,
  vac = c("FFizer", "sinovasm", "aztraseneca", "phiser", "sonovac",
          "faizer", "sinivasc", "astraseneca", "sinocav", "aztraxeneca"),
  match = c("pfizer", "sinovac", "astrazeneca", "pfizer", "sinovac",
            "pfizer", "sinovac", "astrazeneca", "sinovac", "astrazeneca")
)

Looking like this:

# A tibble: 10 x 3
      id vac         match      
   <int> <chr>       <chr>      
 1     1 FFizer      pfizer     
 2     2 sinovasm    sinovac    
 3     3 aztraseneca astrazeneca
 4     4 phiser      pfizer     
 5     5 sonovac     sinovac    
 6     6 faizer      pfizer     
 7     7 sinivasc    sinovac    
 8     8 astraseneca astrazeneca
 9     9 sinocav     sinovac    
10    10 aztraxeneca astrazeneca

The main idea is to find a way of having a homogeneous vac variable

In addition to this, I'd like to create a variable which indicated the "matching degree". I mean, if the string is "FFizer", then its match would be "pfizer" and their matching degree would be around 0.66

I think you would need to specify a little more about your algorithm. How do you arrive at 0.66? Are you checking how many characters match? What if they are scrambled or different lengths? — Carey Caginalp, Apr 14 '21 at 20:45
I didn't give any further detail about the "matching degree" because I do not have any clear idea how the matching would be. Actually, I said 0.66 just counting how many characters were common between the two strings. But the main problem is first how to match vac with the lab vector. — Cristhian, Apr 14 '21 at 20:58
@Cristhian you want to use Levenshtein distance. See [this question](https://stackoverflow.com/questions/3182091/fast-levenshtein-distance-in-r) for options + research the algorithm; fits your use-case perfectly in my opinion — ctwheels, Apr 14 '21 at 21:22
@ctwheels, thank you so much. Reading that post helped me out to find the ```stringdist``` package! — Cristhian, Apr 14 '21 at 22:42

LMc · Answer 1 · 2021-04-14T22:09:53.647

You should take a look at the libraries fuzzyjoin and stringdist. Using Levenshtein distance:

library(fuzzyjoin)
library(dplyr)
library(stringdist)


labs <-  data.frame(vac = c("sinovac", "pfizer", "astrazeneca"))
fuzzyjoin::stringdist_left_join(df, labs,
                                by = c("vac"),
                                method = "lv") %>% 
  dplyr::rename(vac = vac.x, match = vac.y)

Output

   id         vac       match
1   1      FFizer      pfizer
2   2    sinovasm     sinovac
3   3 aztraseneca astrazeneca
4   4      phiser      pfizer
5   5     sonovac     sinovac
6   6      faizer      pfizer
7   7    sinivasc     sinovac
8   8 astraseneca astrazeneca
9   9     sinocav     sinovac
10 10 aztraxeneca astrazeneca

The method of fuzzyjoin takes a stringdist-method from the stringdist library.

Additionally, fuzzyjoin has the argument distance_col, where you can name a column for the computed distance measurement. For example, using Jaro-Winker distance (smaller value is "closer"):

fuzzyjoin::stringdist_left_join(df, labs,
                                 by = c("vac"),
                                 method = "jw",
                                 distance_col = "dist") %>% 
   dplyr::rename(vac = vac.x, match = vac.y)

   id         vac       match       dist
1   1      FFizer     sinovac 0.56349206
2   1      FFizer      pfizer 0.22222222
3   1      FFizer astrazeneca 0.57575758
4   2    sinovasm     sinovac 0.13095238
5   2    sinovasm      pfizer 0.56944444
6   2    sinovasm astrazeneca 0.52272727
7   3 aztraseneca     sinovac 0.51082251
8   3 aztraseneca      pfizer 0.52020202
9   3 aztraseneca astrazeneca 0.03030303
10  4      phiser     sinovac 0.56349206
11  4      phiser      pfizer 0.22222222
12  4      phiser astrazeneca 0.52020202
13  5     sonovac     sinovac 0.15079365
14  5     sonovac      pfizer 1.00000000
15  5     sonovac astrazeneca 0.43290043
16  6      faizer     sinovac 0.56349206
17  6      faizer      pfizer 0.11111111
18  6      faizer astrazeneca 0.44823232
19  7    sinivasc     sinovac 0.13095238
20  7    sinivasc      pfizer 0.56944444
21  7    sinivasc astrazeneca 0.45075758
22  8 astraseneca     sinovac 0.43290043
23  8 astraseneca      pfizer 0.66161616
24  8 astraseneca astrazeneca 0.06060606
25  9     sinocav     sinovac 0.04761905
26  9     sinocav      pfizer 0.56349206
27  9     sinocav astrazeneca 0.51082251
28 10 aztraxeneca     sinovac 0.51082251
29 10 aztraxeneca      pfizer 0.52020202
30 10 aztraxeneca astrazeneca 0.12727273

score 0 · Answer 2 · answered Apr 14 '21 at 22:40

Following some advices here, I read the stringdist package and found two useful functions, amatch to find a match from a vector according to the maximum distance allowed, and strindist to evaluate how different the match is (0 indicates is the same string, and higher value of the distance indicates the string isn't so similar to the target)

correction  = example %>% 
  mutate(
    m = amatch(x = vac, table = labs, maxDist = 2),
    matched_to = case_when(
      m == 1 ~ "sinovac",
      m == 2 ~ "pfizer",
      m == 3 ~ "astrazeneca"
    ),
    similarity = stringdist(vac, matched_to)
  )

Having this output:

# A tibble: 10 x 5
      id vac             m matched_to  similarity
   <int> <chr>       <int> <chr>            <dbl>
 1     1 FFizer          2 pfizer               2
 2     2 astrazeneca     3 astrazeneca          0
 3     3 aztraseneca     3 astrazeneca          2
 4     4 phiser          2 pfizer               2
 5     5 sonovac         1 sinovac              1
 6     6 faizer          2 pfizer               2
 7     7 sinivasc        1 sinovac              2
 8     8 astraseneca     3 astrazeneca          1
 9     9 sinocav         1 sinovac              2
10    10 aztraxeneca     3 astrazeneca          2

Find matching string patterns

2 Answers2