I have an approximately 1 million row pandas dataframe containing data parsed from federal appellate court opinions. I need to extract the names of judges hearing the cases. The data has an unknown number of judges per case (one row) which are contained in a string. That string (currently stored in a single column) contains a lot of excess text as well as has inconsistent formatting and capitalization. I use different dictionaries of judge names (with 2,575 regex keys possible to be used) to match judges listed based on multiple criteria described below. I use the dictionary with the most stringent matching criteria first and gradually loosen the criteria. I also remove the matched string from the source column. The current methods that I have tried are simply too slow (taking days, weeks, or even months). The reason there are multiple possible dictionaries is that many judges share the same (last) names. The strings don't ordinarily include full names. I use data contained in two other columns to get the right match: year the case was decided and the court hearing the case (both integers). I also have higher and lower quality substring search terms. The dictionaries I use can be recreated at will in different formats besides regex if needed.
The fastest solution I have tried was crude and unpythonic. In the initial parsing of the data (extraction of sections and keywords from raw text files), which occurs on a case-by-case basis, I did the following: 1) removed excess text to the degree possible, 2) sorted the remaining text into a list stored within a pandas column, 3) concatenated as strings the year and court to each item in that list, and 4) matched that concatenated string to a dictionary that I had similarly prepared. That dictionary didn't use regular expressions and had approximately 800,000 keys. That process took about a day (with all of the other parsing involved as well) and was not as accurate as I would have liked (because it omitted certain name format permutations). The code below contains my most recent attempt (which is currently running and looks to be among the slowest options yet). It creates subset dictionaries on the fly and still ends up iterating through those smaller dictionaries with regex keys. I've read through and tried to apply solutions from many stackoverflow questions, but couldn't find a workable solution. I'm open to any python-based idea. The data is real data that I've cleaned with a prior function.
import numpy as np
import pandas as pd
test_data = {'panel_judges' : ['CHAGARES, VANASKIE, SCHWARTZ',
'Sidney R. Thomas, Barry G. Silverman, Raymond C. Fisher, Opinion by Thomas'],
'court_num' : [3, 9],
'date_year' : [2014, 2014]}
test_df = pd.DataFrame(data = test_data)
name_dict = {'full_name' : ['Chagares, Michael A.',
'Vanaskie, Thomas Ignatius',
'Schwartz, Charles, Jr.',
'Schwartz, Edward Joseph',
'Schwartz, Milton Lewis',
'Schwartz, Murray Merle'],
'court_num' : [3, 3, 1061, 1097, 1058, 1013],
'circuit_num' : [3, 3, 5, 9, 9, 3],
'start_year' : [2006, 2010, 1976, 1968, 1979, 1974],
'end_year' : [2016, 2019, 2012, 2000, 2005, 2013],
'hq_match' : ['M(?=ICHAEL)? ?A?(?=\.)? ?CHAGARES',
'T(?=HOMAS)? ?I?(?=GNATIUS)? ?VANASKIE',
'C(?=HARLES)? SCHWARTZ',
'E(?=DWARD)? ?J?(?=OSEPH)? ?SCHWARTZ',
'M(?=ILTON)? ?L?(?=EWIS)? ?SCHWARTZ',
'M(?=URRAY)? ?M?(?=ERLE)? ?SCHWARTZ'],
'lq_match' : ['CHAGARES',
'VANASKIE',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ']}
names = pd.DataFrame(data = name_dict)
in_col = 'panel_judges'
year_col = 'date_year'
out_col = 'fixed_panel'
court_num_col = 'court_num'
test_df[out_col] = ''
test_df[out_col].astype(list, inplace = True)
def judge_matcher(df, in_col, out_col, year_col, court_num_col,
size_column = None):
general_cols = ['start_year', 'end_year', 'full_name']
court_cols = ['court_num', 'circuit_num']
match_cols = ['hq_match', 'lq_match']
for match_col in match_cols:
for court_col in court_cols:
lookup_cols = general_cols + [court_col] + [match_col]
judge_df = names[lookup_cols]
for year in range(df[year_col].min(),
df[year_col].max() + 1):
for court in range(df[court_num_col].min(),
df[court_num_col].max() + 1):
lookup_subset = ((judge_df['start_year'] <= year)
& (year < (judge_df['end_year'] + 2))
& (judge_df[court_col] == court))
new_names = names.loc[lookup_subset]
df_subset = ((df[year_col] == year)
& (df[court_num_col] == court))
df.loc[df_subset] = matcher(df.loc[df_subset],
in_col, out_col, new_names, match_col)
return df
def matcher(df, in_col, out_col, lookup, keys):
patterns = dict(zip(lookup[keys], lookup['full_name']))
for key, value in patterns.items():
df[out_col] = (
np.where(df[in_col].astype(str).str.upper().str.contains(key),
df[out_col] + value + ', ', df[out_col]))
df[in_col] = df[in_col].astype(str).str.upper().str.replace(key, '')
return df
df = judge_matcher(test_df, in_col, out_col, year_col, court_num_col)
The output I currently get is essentially right (although the names should be sorted and in a list). The proper "Schwartz" is picked and the matches are all correct. The problem is speed. My goal is to have a de-deduplicated, sorted (alphabetically) list of judges on each panel either stored in a single column or exploded into up to 15 separate columns (I presently do that in a separate vectorized function). I then will do other lookups on those judges based upon other demographic and biographical information. The produced data will be openly available to researchers in the area and the code will be part of a free, publicly available platform usable for studying other courts as well. So accuracy and speed are both important considerations for users on many different machines.