0

I have two columns with ~20k rows of names (not all unique) that I want to compare row-by-row between the two columns. I also would like to compare length and get a % difference in length to LV distance so I can start grouping names based on how closely matched each row is.

Example of subset data:

df <- data.frame(R_Number = c(1:10), A = c('Microsoft', 'Microsoft Corporation', 'Microsoft Corp', 'Microsoft inc', 'Microsoft', 'Microsoft INC', 'Microsoft CORP', 'MSFt', 'Microsoft inc', 'Microsoft'), B = c('Microsoft', 'MSFT', 'MSFT Corp', 'Apple inc', 'Microsoft', 'Microsoft INC', 'Microsoft corp', 'Microsoft', 'AMZN', 'Amazon'))

Example of stringdist function to calculate diff between col rows:

test_2 <- sapply(dist.methods, function(lv) stringdist(df$A, df$B, method=lv))

I get an output table but I am having trouble visualizing the this and getting a new field/table that shows the LV distance per row which shows it's corresponding name.

Desired output:

A     |       B      | LV_DIST

MSFT      Microsoft    8
Dinho
  • 704
  • 4
  • 15

2 Answers2

0

You might not need *apply here (though I might be interpreting your desired output incorrectly).

df$distance <- stringdist(df$A, df$B, method = "lv")

Output:

 R_Number                     A              B distance
        1             Microsoft      Microsoft        0
        2 Microsoft Corporation           MSFT       20
        3        Microsoft Corp      MSFT Corp        8
        4         Microsoft inc      Apple inc        9
        5             Microsoft      Microsoft        0
        6         Microsoft INC  Microsoft INC        0
        7        Microsoft CORP Microsoft corp        4
        8                  MSFt      Microsoft        7
        9         Microsoft inc           AMZN       13
       10             Microsoft         Amazon        8
salexir
  • 46
  • 1
0

If the function stringdist is not vectorized, this should work:

library(tidyverse)

test_2 <- df %>%
    mutate(distance = stringdist(A, B, method='lv'))