Partial matching of two columns of data

Question

My Problem is that I have a column of data (characters & numbers combined) and I want to check for each field if it partially matches with one of the fields in a column of a different df, and if it does write the corresponding value from the second df into a new column of the first variable.

So more practically:

df1 (2x60k)=

QST_ID    QST_CODING
1         M79.68
2         A01.1
3         K07
4         Z09

df2 (2x2451)=

icd    name
A01    intestinal diseases
...
K07    Diseases of oral cavity

And the aim would be to have df1 =

QST_ID    QST_CODING    QST_CODING_RC_NAME
1         M79.68        diseaseX
2         A01.1         Intestinal diseases
3         K07           Diseases of oral cavity
4         Z09           diseaseY

The easiest fix would be to ignore everything after the . and basically treat the 'icd' as icd* But I don't get this to work for the whole list.

I tried doing this with the match, pmatch and grep functions and tried prefixing all icd with ^ (regex wildcat) but it didn't work at all:

df1$QST_CODING_RC_name <- df2$name[pmatch(as.character(df1$QST_CODING), df2$icd, duplicates.ok = T)]

df1$QST_CODING_RC_name <- df2$name[grep(df2$icd, as.character(df1$QST_CODING), ignore.case = T)]

I would be very happy if anyone has a solution for this. Thanks a lot for your help.

Edit: QST_CODING contains values of a different form(g.523, etc.), where the information after the point is necessary to work with later and cannot simply be dropped. Furthermore, the length of bothe the part in front of the point as well as the part after it differs in QST_CODING.

First use `gsub("\\..*", "", df1$QST_CODING)` or similar to drop all after the `.`, then `merge(df1, df2, by.x="QST_CODING", by.y="icd", all.x=TRUE)`. — r2evans, Sep 13 '18 at 22:35
Correction: `by.x="QST_abbreviated"`, that being the column created by the `gsub` command. — r2evans, Sep 13 '18 at 22:42
This is asking for partial matching. Might be a duplicate, but probably not for the nominated Q. Search on: "merge partial match [r]". If the question was "How to match on the first three characters of one column with another column then perhaps an exact match answer would be meaningful. — IRTFM, Sep 13 '18 at 23:43
Thanks a lot for the help. Unfortunately, this doesn't work however because QST_CODING also contains values such as g.235, K-325.65, etc. And because these are a different code, i need them to stay as they are and recode them later. And for the solution of just matching the first x characters: QST_CODING has different lengths (before the point a- well as in total). — LTribe, Sep 14 '18 at 04:56
This question has a specific answer relating to ICD codes, so perhaps not a duplicate. Using the icd CRAN package (I'm the author), there are tried and tested algorithms for doing this. I think what you are doing is assigning comorbidities. icd can do this with standardized disease groups, or you can use your own. All this is well documented. https://jackwasey.github.io/icd/ — Jack Wasey, Sep 16 '18 at 12:05

score 0 · Answer 1 · answered Sep 14 '18 at 08:42

You could try using the fuzzyjoin package with the stringdist_joins. I used the method = "soundex", but method = "qgram" gives in this example the same results. On your full dataset you might have a better result using qgrams if the icd codes are always a part of the QST_CODING field, but that is something you have to investigate.

library(fuzzyjoin)

stringdist_left_join(df, df, by = c("QST_CODING" = "icd"), method = "soundex")

  QST_ID QST_CODING  icd                    name
1      1     M79.68 <NA>                    <NA>
2      2      A01.1  A01     intestinal diseases
3      3        K07  K07 Diseases of oral cavity
4      4        Z09 <NA>                    <NA>

Partial matching of two columns of data

1 Answers1