Efficient way to handle string similarity?

Question

I got stuck on some string similarity issues.

This is how my data looks like (the original data is huge):

SerialNumber                SubSerialID            Date

AGCC0775CFNDA1040TMT775     AVCC0775CFNDA1040     2018/01/08
AGCC0775CFNDA1040           AVCC0775CFNDA1040     2015/12/28
AGCC0775CFNDA10407EC        AVCC0775CFNDA1040     2018/03/17
CH~MT765E~C0765HFNCC1056    BGDC0865HFNKG1043     2019/01/07
2658358                     BGDC0865HFNKG1043     2018/08/09
MT765E~C0765KFNCD1044       C0765KFNCD10          2015/04/07
187A126                     C0765KFNCD10          2017/11/31

...

My target is:

SerialNumber                SubSerialID            Date 

AGCC0775CFNDA10407EC        AVCC0775CFNDA1040     2018/03/17
CH~MT765E~C0765HFNCC1056    BGDC0865HFNKG1043     2019/01/07
2658358                     BGDC0865HFNKG1043     2018/08/09
MT765E~C0765KFNCD1044       C0765KFNCD10          2015/04/07
187A126                     C0765KFNCD10          2017/11/31

...

Serial Numbers AGCC0775CFNDA1040TMT775, AGCC0775CFNDA1040, and AGCC0775CFNDA10407EC are the same thing but caused by mistakes. I want to keep AGCC0775CFNDA10407EC because it has the latest date on record. However, I cannot use SubSerialID and Date directly to filter those Serial Numbers because if will remove 2658358.

I thought about using stringdist to find string similarity as another condition (i.e., filter out by abs (similarity) >1.5 and abs (similarity)<0.5) but cannot figure out an efficient way to handle it. The data it's huge and using for loop is unpragmatic. I got stuck for a while and hopefully, someone can give me some advice or suggestion with this matter.

I always liked the 31st's of November;-) – Maurits Evers Feb 17 '20 at 23:53 — Maurits Evers, Feb 17 '20 at 23:53

score 3 · Accepted Answer · answered Feb 18 '20 at 00:07

The following reproduces your expected output

library(dplyr)
library(purrr)
df %>%
    mutate(Date = as.Date(Date)) %>%
    mutate_if(is.factor, as.character) %>%
    mutate(dist = map2_dbl(SerialNumber, SubSerialID, adist)) %>%
    group_by(SubSerialID) %>%
    filter(all(dist > 5) | Date == max(Date)) %>%
    ungroup()
## A tibble: 5 x 4
#  SerialNumber             SubSerialID       Date        dist
#  <chr>                    <chr>             <date>     <dbl>
#1 AGCC0775CFNDA10407EC     AVCC0775CFNDA1040 2018-03-17     4
#2 CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019-01-07    15
#3 2658358                  BGDC0865HFNKG1043 2018-08-09    15
#4 MT765E~C0765KFNCD1044    C0765KFNCD10      2015-04-07     9
#5 187A126                  C0765KFNCD10      2017-11-30    11

The idea is to keep all entries (per SubSerialID) if all Levenshtein distances between SubserialID and SerialNumber are greater than 5. If there is one distance <= 5, only keep the row with the largest Date. I've kept the dist column for debugging; you can remove the column with select(-dist).

I'm not sure how generalisable this is. You will have to play around with the Levenshtein distance threshold (which I set to 5 in this case).

Sample data

df <- read.table(text =
"SerialNumber                SubSerialID            Date

AGCC0775CFNDA1040TMT775     AVCC0775CFNDA1040     2018/01/08
AGCC0775CFNDA1040           AVCC0775CFNDA1040     2015/12/28
AGCC0775CFNDA10407EC        AVCC0775CFNDA1040     2018/03/17
CH~MT765E~C0765HFNCC1056    BGDC0865HFNKG1043     2019/01/07
2658358                     BGDC0865HFNKG1043     2018/08/09
MT765E~C0765KFNCD1044       C0765KFNCD10          2015/04/07
187A126                     C0765KFNCD10          2017/11/30", header = T)

Just want to thank you again for showing me this smart idea. I found some special cases and use the same tricks to add one more restriction to find the perfect results I want! Truly learned a lot from you. @Maurits Evers — mimibao1009, Feb 18 '20 at 15:27

Efficient way to handle string similarity?

1 Answers1

Sample data