R: Join two data frames via a character column that has small misspellings

Question

My two data frames have the same character column. It would be easy to join them by this column using dplyr::full_joint. But the problem is that the common column has slight but obvious differences in spelling. The spelling differences are small relative to each string defining the skill:

Skill                   Grade_Judge_A

pack & ship               1
pack & store              5
sell                      3
Design a room             9


Skill                   Grade_Judge_B

pack and store            3
pack & ship               7
sell                      2
Design room               6

How to achieve the desired output below:

Skill                   Grade_Judge_A      Grade_Judge_B

pack & ship               1                     3                
pack & store              5                     7
sell                      3                     2
Design a room             9                     6

I was thinking matching rows in the two dataframes based on distance between strings in "Skill" columns, for example using stringdist package. If the difference between two strings is small, then it would mean that the skill is the same.

I'd prefer a dplyr/tidyverse solution.

Here is an actual dput for dataframe A:

> dput(df_A)

structure(list(skill = c(" [Assess abdomen for a floating mass]", 
" [Assess Nerve Root Compression]", " [Evaluate breathing effort (rate, patterns, chest expansions)]", 
" [Evaluate Plantar Reflex/Babinski sign]", " [Evaluate Speech]", 
" [External palpation of a uterus]", " [Heel to Shin test]", 
" [Inspect anterior chamber of eye with ophthalmoscope or penlight]", 
" [Inspect breast]", " [Inspect Overall Skin Color/Tone]", " [Inspect Skin Lesions]", 
" [Inspect Wounds]", " [Mental Status/level of consciousness]", 
" [Nose/index finger]", " [Percuss abdomen to determine spleen size]", 
" [Percuss costovertebral angle for kidney tenderness]", " [Percuss for diaphragmatic excursion]", 
" [Percuss the abdomen for abdominal tones]", " [Percuss the abdomen to determine liver span]"
), `2016-09-17 13:41:08` = c(1, 1, 5, 3, 4, 0, 4, 3, 3, 5, 4, 
5, 5, 3, 1, 1, 2, 4, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-17 13:41:08"
))

and dataframe B:

> dput(df_B)

structure(list(skill = c(" [Assess abdomen for floating mass]", 
" [Assess nerve root compression]", " [Evaluate breathing effort (rate, patterns, chest expansion)]", 
" [Evaluate plantar reflex/Babinski sign]", " [Evaluate speech]", 
" [External palpation of uterus]", " [Heel to shin test]", " [Inspect anterior chamber of the eye with opthalmoscope or penlight]", 
" [Inspect breasts]", " [Inspect overall skin color/tone]", " [Inspect skin lesions]", 
" [Inspect wounds]", " [Mental status/level of consciousness]", 
" [Nose/Index finger]", " [Percuss costovertebral angle for kidney tenderness]", 
" [Percuss for diaphragmatic excursion]", " [Percuss the abdomen for abdominal tones]", 
" [Percuss the abdomen to determine liver span]", " [Percuss the abdomen to determine spleen size]"
), `2016-09-21 07:58:43` = c(0, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-21 07:58:43"
))

Here are the heads of both dataframes:

 > head(df_A)
        # A tibble: 6 × 2
                                                                    skill `2016-09-17 13:41:08`
                                                                    <chr>                 <dbl>
        1                            [Assess abdomen for a floating mass]                     1
        2                                 [Assess Nerve Root Compression]                     1
        3  [Evaluate breathing effort (rate, patterns, chest expansions)]                     5
        4                         [Evaluate Plantar Reflex/Babinski sign]                     3
        5                                               [Evaluate Speech]                     4
        6                                [External palpation of a uterus]                     0

and the second one:

> head(df_B)
# A tibble: 6 × 2
                                                           skill `2016-09-21 07:58:43`
                                                           <chr>                 <dbl>
1                             [Assess abdomen for floating mass]                     0
2                                [Assess nerve root compression]                     2
3  [Evaluate breathing effort (rate, patterns, chest expansion)]                     2
4                        [Evaluate plantar reflex/Babinski sign]                     2
5                                              [Evaluate speech]                     2
6                                 [External palpation of uterus]                     1

You can address systematic differences before the merge (for example, replace "&" with "and" and convert all strings to lower case). But are there random misspellings as well? — eipi10, Nov 14 '16 at 00:41
Also, `full_joint` might be appropriate if you're seeking to merge into the universal consciousness, but `full_join` will be more effective when applied to data. — eipi10, Nov 14 '16 at 00:42
It is not known what type of misspellings can be present. All is known that they are small relative to the length of the string. A string defining each skill is 20 to 60 characters long but differences are small such as missing articles. Can one use stringdist to pair the skills? If stringdist is small, then it is assumed to be the same skill. — Irakli, Nov 14 '16 at 00:45
Yes, `stringdist` might be able to take care of most issues, though you might run into some cases where correctly spelled skills are also similar to each other. Can you provide a larger set of examples of your data? Use `dput` to provide the sample data. — eipi10, Nov 14 '16 at 00:49
https://cran.r-project.org/web/packages/fuzzyjoin/index.html might be relevant here. — thelatemail, Nov 14 '16 at 00:52
Awsom @thelatemail . I was looking for something like that since last year! — MFR, Nov 14 '16 at 00:57
Yes, those are helpful to compare two strings, but how to do for dataframes? Note that you cannot just compare row vs row, because misspellings mess up alphabetical sorting — Irakli, Nov 14 '16 at 01:05

score 1 · Answer 1 · answered Nov 09 '17 at 23:03

How close does this get?

require(tidyverse)
require(stringdist)

df_A %>%
    rownames_to_column %>%
    mutate(foo=1) %>%
    full_join((df_B %>% rownames_to_column %>% mutate(foo=1)), by='foo') %>%
    select(-foo) %>%
    mutate(dist = stringdist(skill.x, skill.y), norm_dist = dist / length(skill.x)) %>%
    arrange(norm_dist) %>%
    filter(norm_dist < 0.015)

I do a true (relational-algebra-style) full join on df_A and df_B, which will blow up your data if the real data you have is large (e.g. if both data frames have 1000 rows the result of the join will bee 1,000,000 rows). This join is done by creating a dummy column foo that is equal for every row and then joining on the dummy column.

The stringdist package, mentioned in the comments, then compares row A to row B for every possible combination of the two strings. For your example data a cutoff of 0.015 of the normalized string distance, results seemed good. Of course this arbitrary cutoff my be overfit to your example data.

From the examples, this could be made hugely more efficient by only joining when the strings start with the same letter (if additional things are known about the errors then modify as appropriate). `mutate(foo = tolower(substr(skill, 1, 1)))` — Moohan, Aug 12 '19 at 15:30

score 0 · Answer 2 · answered Nov 14 '16 at 00:22

0

If there is no pattern in the spelling mistake , I believe the only remaining way is making sure that spellings are same before joining the data. We can use splitstackshape package

library(splitstackshape)

yourdata$skill<-stri_replace_all(yourdata8$skill,"pack & store" ,fixed = "pack and store")

This code replace pack and store with pack & store in skill column of your dataset

answered Nov 14 '16 at 00:22

MFR

2,049
3
29
53

Skills are long character strings but misspellings are random and small (such as Upper instead of lower case or an article missing). What kind of misspellings are present is not known. All is known is that misspellings are small relative to the string length. – Irakli Nov 14 '16 at 00:25
No sure, Perhaps you want to see the differences between the skill column and bsed on these differences label them http://stackoverflow.com/questions/28834459/extract-characters-that-differ-between-two-strings – MFR Nov 14 '16 at 00:45
I think stringdist will work better because it tells the numeric "distance" between two strings. I am just not sure how to implement it for dataframes – Irakli Nov 14 '16 at 00:49
according their documentation you can see if the differences are more than a threshold or not. This is one example that tell you if the difference is more than 1 or not. `amatch("leia",c("uhura","leela"),maxDist=1)`. If you provide a reproducible example I'll try – MFR Nov 14 '16 at 00:55
I added dput of a sample of rows of from both dataframes – Irakli Nov 14 '16 at 01:03
Yes, a good way to set up sensitivity is the ratio of the distance between two strings over the string lengths. – Irakli Nov 14 '16 at 01:14
I couldn't produce the desired output, but definitely I'll be back here and try based on the suggestions in this post. This question is my question too,and normally I replace data which is tiring. My life will be easier if I could solve this problem. – MFR Nov 14 '16 at 01:39

R: Join two data frames via a character column that has small misspellings

2 Answers2