I have a dataset with 6 million court proceedings from 59 different immigration courts. Each record includes among other things an attorney code. However there are multiple codes associated with each unique attorney. And in another table that associates the attorney codes with attorney names, the attorney names are listed in a single string in every imaginable permutation of firstName, lastName, middleName, Esq, etc. The attorney names also include misspellings. I believe there are upwards of 10,000 unique attorneys.
Here is an example of the unique codes/name pairs associated with a single attorney whose true (disguised-here) name should be "H. Todd Smith" (first column is row numbers):
attorneycode attorneyname
16 SEH H. TODD SMITH, ESQ.
26 HTS H. TODD SMITH, ESQ.
23 HTS SMITH ESQ., H. TODD
9 HTS SMITH, ESQ., H. TODD
18 SES SMITH, ESQ., TODD H.
7 SES SMITH, H. TODD
12 S02 SMITH, H. TODD
1 Q<B Smith, H. Todd, Esq.
2 SHT Smith, H. Todd, Esq.
3 ..T Smith, H. Todd, Esq.
5 HTS Smith, H. Todd, Esq.
8 SHT SMITH, H. TODD, ESQ.
21 SH0 SMITH, H. TODD, ESQ.
20 ses smith, todd
11 SES SMITH, TODD H.
6 SSM SMITH, TODD H., ESQ.
4 SES Todd H. Smith, Esq.
17 THM TODD H. SMITH, ESQ.
My goal is to assign a unique ID to each unique attorney and all his/her current associated codes/names. But there is no listing of "true" names against which I can "match". How can I generate a list of the best candidates for unique attorney names?
EDIT
Richie Cotton suggests this question is answered in another post -- but the other post deals with "duplicate" records whereas in my data these are not duplicates, but rather unique records that appear to be related to the same person, though not coded as such. Moreover the primary suggested answer in the linked post involves using a package that is no longer available on CRAN. Am I misunderstanding something?