I have two excel sheets with insurance claims data from two different insurance providers. I need to find cases of individuals that have filed claims under both providers.
I would like to have something that pairs names if it seems likely that they are the same name, but does nothing if it doesn't find a similar enough name in the other sheet. From what I have read I think I need to use fuzzy strings for this (and maybe the DL distance). I know R has a string distance function, adist, but I am struggling to learn to use it properly.
For an example:
Provider 1:
Ms. Smith 35 F Portland,OR Cardiac
Adam Jacobs 27 M San Francisco, CA Gynecology
Emily Lo 19 F Portland,OR Ortho
Frances Wu 33 F Dallas, TX ENT
Provider 2:
Clara Smith 35 F Portland,OR Cardiac
Bill White 29 M San Francisco, CA Ortho
Emily S. Lo 19 F Portland,OR Ortho
Dev Patel 22 M Dallas, TX Neuro
So here it should recognize that Emily S. Lo is the same person as Emily Lo, and that Clara Smith is the same as Ms.Smith and give me a list with their names and information. How do I do this?
I tried copying what this person did: http://bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/ I tried with their data, copy/pasting their code and I keep getting a 0x0 result.