My two data frames have the same character column. It would be easy to join them by this column using dplyr::full_joint. But the problem is that the common column has slight but obvious differences in spelling. The spelling differences are small relative to each string defining the skill:
Skill Grade_Judge_A
pack & ship 1
pack & store 5
sell 3
Design a room 9
Skill Grade_Judge_B
pack and store 3
pack & ship 7
sell 2
Design room 6
How to achieve the desired output below:
Skill Grade_Judge_A Grade_Judge_B
pack & ship 1 3
pack & store 5 7
sell 3 2
Design a room 9 6
I was thinking matching rows in the two dataframes based on distance between strings in "Skill" columns, for example using stringdist package. If the difference between two strings is small, then it would mean that the skill is the same.
I'd prefer a dplyr/tidyverse solution.
Here is an actual dput for dataframe A:
> dput(df_A)
structure(list(skill = c(" [Assess abdomen for a floating mass]",
" [Assess Nerve Root Compression]", " [Evaluate breathing effort (rate, patterns, chest expansions)]",
" [Evaluate Plantar Reflex/Babinski sign]", " [Evaluate Speech]",
" [External palpation of a uterus]", " [Heel to Shin test]",
" [Inspect anterior chamber of eye with ophthalmoscope or penlight]",
" [Inspect breast]", " [Inspect Overall Skin Color/Tone]", " [Inspect Skin Lesions]",
" [Inspect Wounds]", " [Mental Status/level of consciousness]",
" [Nose/index finger]", " [Percuss abdomen to determine spleen size]",
" [Percuss costovertebral angle for kidney tenderness]", " [Percuss for diaphragmatic excursion]",
" [Percuss the abdomen for abdominal tones]", " [Percuss the abdomen to determine liver span]"
), `2016-09-17 13:41:08` = c(1, 1, 5, 3, 4, 0, 4, 3, 3, 5, 4,
5, 5, 3, 1, 1, 2, 4, 1)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-17 13:41:08"
))
and dataframe B:
> dput(df_B)
structure(list(skill = c(" [Assess abdomen for floating mass]",
" [Assess nerve root compression]", " [Evaluate breathing effort (rate, patterns, chest expansion)]",
" [Evaluate plantar reflex/Babinski sign]", " [Evaluate speech]",
" [External palpation of uterus]", " [Heel to shin test]", " [Inspect anterior chamber of the eye with opthalmoscope or penlight]",
" [Inspect breasts]", " [Inspect overall skin color/tone]", " [Inspect skin lesions]",
" [Inspect wounds]", " [Mental status/level of consciousness]",
" [Nose/Index finger]", " [Percuss costovertebral angle for kidney tenderness]",
" [Percuss for diaphragmatic excursion]", " [Percuss the abdomen for abdominal tones]",
" [Percuss the abdomen to determine liver span]", " [Percuss the abdomen to determine spleen size]"
), `2016-09-21 07:58:43` = c(0, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -19L), .Names = c("skill", "2016-09-21 07:58:43"
))
Here are the heads of both dataframes:
> head(df_A)
# A tibble: 6 × 2
skill `2016-09-17 13:41:08`
<chr> <dbl>
1 [Assess abdomen for a floating mass] 1
2 [Assess Nerve Root Compression] 1
3 [Evaluate breathing effort (rate, patterns, chest expansions)] 5
4 [Evaluate Plantar Reflex/Babinski sign] 3
5 [Evaluate Speech] 4
6 [External palpation of a uterus] 0
and the second one:
> head(df_B)
# A tibble: 6 × 2
skill `2016-09-21 07:58:43`
<chr> <dbl>
1 [Assess abdomen for floating mass] 0
2 [Assess nerve root compression] 2
3 [Evaluate breathing effort (rate, patterns, chest expansion)] 2
4 [Evaluate plantar reflex/Babinski sign] 2
5 [Evaluate speech] 2
6 [External palpation of uterus] 1