What is the fastest way to match, replace, and extract substrings from pandas dataframe with multiple criteria?

Question

I have an approximately 1 million row pandas dataframe containing data parsed from federal appellate court opinions. I need to extract the names of judges hearing the cases. The data has an unknown number of judges per case (one row) which are contained in a string. That string (currently stored in a single column) contains a lot of excess text as well as has inconsistent formatting and capitalization. I use different dictionaries of judge names (with 2,575 regex keys possible to be used) to match judges listed based on multiple criteria described below. I use the dictionary with the most stringent matching criteria first and gradually loosen the criteria. I also remove the matched string from the source column. The current methods that I have tried are simply too slow (taking days, weeks, or even months). The reason there are multiple possible dictionaries is that many judges share the same (last) names. The strings don't ordinarily include full names. I use data contained in two other columns to get the right match: year the case was decided and the court hearing the case (both integers). I also have higher and lower quality substring search terms. The dictionaries I use can be recreated at will in different formats besides regex if needed.

The fastest solution I have tried was crude and unpythonic. In the initial parsing of the data (extraction of sections and keywords from raw text files), which occurs on a case-by-case basis, I did the following: 1) removed excess text to the degree possible, 2) sorted the remaining text into a list stored within a pandas column, 3) concatenated as strings the year and court to each item in that list, and 4) matched that concatenated string to a dictionary that I had similarly prepared. That dictionary didn't use regular expressions and had approximately 800,000 keys. That process took about a day (with all of the other parsing involved as well) and was not as accurate as I would have liked (because it omitted certain name format permutations). The code below contains my most recent attempt (which is currently running and looks to be among the slowest options yet). It creates subset dictionaries on the fly and still ends up iterating through those smaller dictionaries with regex keys. I've read through and tried to apply solutions from many stackoverflow questions, but couldn't find a workable solution. I'm open to any python-based idea. The data is real data that I've cleaned with a prior function.

import numpy as np
import pandas as pd

test_data = {'panel_judges' : ['CHAGARES, VANASKIE, SCHWARTZ', 
                               'Sidney R. Thomas, Barry G. Silverman, Raymond C. Fisher, Opinion by Thomas'],
             'court_num' : [3, 9],
             'date_year' : [2014, 2014]}
test_df = pd.DataFrame(data = test_data)

name_dict = {'full_name' : ['Chagares, Michael A.', 
                            'Vanaskie, Thomas Ignatius',
                            'Schwartz, Charles, Jr.',
                            'Schwartz, Edward Joseph',
                            'Schwartz, Milton Lewis',
                            'Schwartz, Murray Merle'],
             'court_num' : [3, 3, 1061, 1097, 1058, 1013],
             'circuit_num' : [3, 3, 5, 9, 9, 3],
             'start_year' : [2006, 2010, 1976, 1968, 1979, 1974],
             'end_year' : [2016, 2019, 2012, 2000, 2005, 2013],
             'hq_match' : ['M(?=ICHAEL)? ?A?(?=\.)? ?CHAGARES',
                           'T(?=HOMAS)? ?I?(?=GNATIUS)? ?VANASKIE',
                           'C(?=HARLES)? SCHWARTZ',
                           'E(?=DWARD)? ?J?(?=OSEPH)? ?SCHWARTZ',
                           'M(?=ILTON)? ?L?(?=EWIS)? ?SCHWARTZ',
                           'M(?=URRAY)? ?M?(?=ERLE)? ?SCHWARTZ'],
             'lq_match' : ['CHAGARES',
                           'VANASKIE',
                           'SCHWARTZ', 
                           'SCHWARTZ', 
                           'SCHWARTZ', 
                           'SCHWARTZ']}
names = pd.DataFrame(data = name_dict)

in_col = 'panel_judges'
year_col = 'date_year'
out_col = 'fixed_panel'
court_num_col = 'court_num'

test_df[out_col] = ''
test_df[out_col].astype(list, inplace = True)

def judge_matcher(df, in_col, out_col, year_col, court_num_col, 
                  size_column = None):
    general_cols = ['start_year', 'end_year', 'full_name']
    court_cols = ['court_num', 'circuit_num']
    match_cols = ['hq_match', 'lq_match']
    for match_col in match_cols:
        for court_col in court_cols:
            lookup_cols = general_cols + [court_col] + [match_col]
            judge_df = names[lookup_cols]
            for year in range(df[year_col].min(),
                              df[year_col].max() + 1):
                for court in range(df[court_num_col].min(),
                                   df[court_num_col].max() + 1):
                    lookup_subset = ((judge_df['start_year'] <= year)
                                     & (year < (judge_df['end_year'] + 2))
                                     & (judge_df[court_col] == court))  
                    new_names = names.loc[lookup_subset]
                    df_subset = ((df[year_col] == year) 
                                  & (df[court_num_col] == court))
                    df.loc[df_subset] = matcher(df.loc[df_subset], 
                          in_col, out_col, new_names, match_col)  
    return df   

def matcher(df, in_col, out_col, lookup, keys):
    patterns = dict(zip(lookup[keys], lookup['full_name']))
    for key, value in patterns.items():
        df[out_col] = ( 
             np.where(df[in_col].astype(str).str.upper().str.contains(key), 
                                  df[out_col] + value + ', ', df[out_col]))
        df[in_col] = df[in_col].astype(str).str.upper().str.replace(key, '')    
    return df 

df = judge_matcher(test_df, in_col, out_col, year_col, court_num_col)

The output I currently get is essentially right (although the names should be sorted and in a list). The proper "Schwartz" is picked and the matches are all correct. The problem is speed. My goal is to have a de-deduplicated, sorted (alphabetically) list of judges on each panel either stored in a single column or exploded into up to 15 separate columns (I presently do that in a separate vectorized function). I then will do other lookups on those judges based upon other demographic and biographical information. The produced data will be openly available to researchers in the area and the code will be part of a free, publicly available platform usable for studying other courts as well. So accuracy and speed are both important considerations for users on many different machines.

Even on a laptop this is just _way_ too much text - it's a solid wall. Whatever the issue is, you need to consolidate it down because people just won't read through that — roganjosh, Jan 29 '19 at 22:29
Check out [For loops with pandas - When should I care?](https://stackoverflow.com/questions/54028199/for-loops-with-pandas-when-should-i-care). Scroll to the section on string methods. — cs95, Jan 29 '19 at 22:31
_an approximately 1 million row pandas dataframe (although any solution needs to scale for larger projects as well)_ - I don't think pandas is the right tool for such _big data games_. I'm not familiar with it but perhaps you should consider having a look at `dask` — SpghttCd, Jan 29 '19 at 22:37
Thanks. I had trouble distilling the problem further because, if I omit certain aspects of it, suggested solutions might not work. I've been messing around with this problem for weeks and so I wanted to explain all that I've tried. I can try to edit it, but I worry leaving some information out might waste the time of readers who offer inapplicable solutions. — with_precedent, Jan 29 '19 at 22:37
Thanks - I have read that post and it is very helpful. Because I haven't been able to implement a fully vectorized solution, I haven't been able to see if I can make any significant gains. And my problem falls between a couple categories listed there. — with_precedent, Jan 29 '19 at 22:39
It's that, or have virtually no readers. The people you want answers from are experienced programmers; they can fill in gaps if the statement is clear "Dataframe might be 100 million rows" etc. — roganjosh, Jan 29 '19 at 22:39
Regarding dask, I have two worries: 1) I'm not sure if I will get any speed gains because I'm currently fine on memory and can load the whole dataframe in memory; and 2) it would decrease portability and use by others. But I might have missed some of the potential gains of using dask beyond access to datasets that don't fit in memory. — with_precedent, Jan 29 '19 at 22:41
I've removed a bunch of text, but it is still on the longer side. I will try to cut back a bit more later. Thanks for the suggestion. — with_precedent, Jan 29 '19 at 22:50
You need to distill this down to a smaller problem/subproblems. This is just too much code to be able to give advice for here. I think there is an interesting problem here, but it's hidden in this wall of code/large example. (I also think this somewhat suffers from the [XY Problem](https://meta.stackexchange.com/a/66378/184179), and you should step back from it to think of alternative solutions.) — Andy Hayden, Jan 30 '19 at 00:18
You want to leverage str.match or str.extract, rather than using so many for loops... but like I say, it's unclear how to do this with such a verbose example. — Andy Hayden, Jan 30 '19 at 00:20
Generally, text-processing tasks are not a great use-case for pandas. String manipulation isn't really going to benefit from pandas/numpy. Perhaps don't use pandas for this? — juanpa.arrivillaga, Jan 30 '19 at 00:53
Oddly, it was because I'd read many questions that failed due to the XY Problem that I wrote such a long question. I wanted to make sure I was clear what my overall goal was so that any suggestions would be well directed. I can solve all of my sub-problems independently, but when I put them together problems emerge. Each sub-solution has tradeoffs: regex vs. trie vs. string, pandas vs. non-pandas, vectorize vs. apply vs. iteration. My hope was to find a suggestion where all of the pieces worked well together. I appreciate the suggestion and will give more thought to breaking it down. — with_precedent, Jan 30 '19 at 15:53
I use pandas because 1) the data will end up there eventually for machine learning; 2) the data comes from 1 million large text files (approximately 25 GB) and pandas provides a nice container for the key sections I parse out; and 3) it has actually worked really well for all of my other string extractions and matching. np.where and pandas string functions do really well with regex keyword searches. The judge name problem is the one I haven't been able to solve in an efficient way. But I'm totally open to other alternatives. — with_precedent, Jan 30 '19 at 15:56

score 0 · Answer 1 · answered Feb 07 '19 at 15:44

For anyone who stumbles across this question and has a similar complex string matching issue in pandas, this is the solution I found to be the fastest.

It isn't fully vectorized like I wanted, but I used df.apply with this method within a class:

def judge_matcher(self, row, in_col, out_col, year_col, court_num_col, 
                  size_col = None):
    final_list = []
    raw_list = row[in_col]
    cleaned_list = [x for x in raw_list if x]
    cleaned_list = [x.strip() for x in cleaned_list]
    for name in cleaned_list:
        name1 = self.convert_judge_name(row[year_col],
                                        row[court_num_col], name, 1)
        name2 = self.convert_judge_name(row[year_col],
                                        row[court_num_col], name, 2)
        if name1 in self.names_dict_list[0]:
            final_list.append(self.names_dict_list[0].get(name1))
        elif name1 in self.names_dict_list[1]:
            final_list.append(self.names_dict_list[1].get(name1))
        elif name2 in self.names_dict_list[2]:
            final_list.append(self.names_dict_list[2].get(name2))
        elif name2 in self.names_dict_list[3]:
            final_list.append(self.names_dict_list[3].get(name2))
        elif name in self.names_dict_list[4]:
            final_list.append(self.names_dict_list[4].get(name)) 
    final_list = list(unique_everseen(final_list))
    final_list.sort()
    row[out_col] = final_list
    if size_col and final_list:
        row[size_col] = len(final_list)
    return row 

@staticmethod
def convert_judge_name(year, court, name, dict_type):
    if dict_type == 1:
        return str(int(court) * 10000 + int(year)) + name
    elif dict_type == 2:
        return str(int(year)) + name
    else:
        return name

Basically, it concatenates three columns together and performs hashed dictionary lookups (instead of regexes) with the concatenated strings. Multiplication is used to efficiently concatenate the two numbers to be side-by-side as strings. The dictionaries had similarly prepared keys (and the values are the desired strings). By using lists and then deduplicating, I didn't have to remove the matched strings. I didn't time this specific function, but the overall module took just over 10 hours to process ~ 1 million rows. When I run it again, I will try to remember to time this applied function specifically and post the results here. The method is ugly, but fairly effective.

What is the fastest way to match, replace, and extract substrings from pandas dataframe with multiple criteria?

1 Answers1