0

I have a dataset with 6 million court proceedings from 59 different immigration courts. Each record includes among other things an attorney code. However there are multiple codes associated with each unique attorney. And in another table that associates the attorney codes with attorney names, the attorney names are listed in a single string in every imaginable permutation of firstName, lastName, middleName, Esq, etc. The attorney names also include misspellings. I believe there are upwards of 10,000 unique attorneys.

Here is an example of the unique codes/name pairs associated with a single attorney whose true (disguised-here) name should be "H. Todd Smith" (first column is row numbers):

   attorneycode         attorneyname
16          SEH  H. TODD SMITH, ESQ.
26          HTS  H. TODD SMITH, ESQ.
23          HTS  SMITH ESQ., H. TODD
9           HTS SMITH, ESQ., H. TODD
18          SES SMITH, ESQ., TODD H.
7           SES       SMITH, H. TODD
12          S02       SMITH, H. TODD
1           Q<B Smith, H. Todd, Esq.
2           SHT Smith, H. Todd, Esq.
3           ..T Smith, H. Todd, Esq.
5           HTS Smith, H. Todd, Esq.
8           SHT SMITH, H. TODD, ESQ.
21          SH0 SMITH, H. TODD, ESQ.
20          ses          smith, todd
11          SES       SMITH, TODD H.
6           SSM SMITH, TODD H., ESQ.
4           SES  Todd H. Smith, Esq.
17          THM  TODD H. SMITH, ESQ.

My goal is to assign a unique ID to each unique attorney and all his/her current associated codes/names. But there is no listing of "true" names against which I can "match". How can I generate a list of the best candidates for unique attorney names?

EDIT

Richie Cotton suggests this question is answered in another post -- but the other post deals with "duplicate" records whereas in my data these are not duplicates, but rather unique records that appear to be related to the same person, though not coded as such. Moreover the primary suggested answer in the linked post involves using a package that is no longer available on CRAN. Am I misunderstanding something?

kmayeaux
  • 67
  • 4
  • Do we have any guarantee that Todd Smith is really Todd H. Smith? Are you expecting this method to resolve such ambiguities? Does this "attorney code" have any value in this effort? – IRTFM Jul 12 '14 at 23:31
  • If you also had a variable for "court appeared in", you might be able to go further to distinguish attorneys. My second quick thought was to lowercase the variable and to remove "Esq." if it appears. Beyond that, agrep() doesn't help because you can't specify a pattern. I do not know enough about adist() or the dissimilarity function in the tm package. – lawyeR Jul 13 '14 at 01:13
  • There is also an associated court city code for each proceeding. And the attorney table also includes the attorney's address (or addresses as the case may be) as well as an associated court code. Both the court code and attorney code may assist in distinguishing between attorneys. – kmayeaux Jul 13 '14 at 03:35
  • I don't know how to code this, but let's assume the attorney addresses are unique (or much closer to being unique than the name set). Then, for each presumed unique address, pick the attorney name that is in the format "^[:alpha:], .... That is, Smith, ... and copy that to a new column called "Assumed Name". That would move you farther along, although all kinds of problems might then surface. – lawyeR Jul 14 '14 at 00:02

0 Answers0