Fuzzy regex match on million rows Pandas df

Question

I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.

For eg:

df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows

ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows

###Output should look like 

df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])

It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)

For eg - 'PARIS' can match against 'PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.

Similarly, 'XANDER' matches against 'NOVA XANDER' and 'SALA MANDER' (MANDER being 1 char diff from XANDER) , but does not generate match against 'ALEXANDERS'.

As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.

Current code -

tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])


def search_it(partyname):
    m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
    if m is not None:
        return m.group()
    else:
        return None
    
df['MATCH'] = df['NAMES'].str.apply(search_it)

Also, will multiprocessing help with regex ? Many thanks in advance!

Yes, upto 1 substitution,deletion or addition allowed. Thanks! — StarMunch, Dec 14 '20 at 10:05
First of all you can simplify your regexp: `\s+{tag}\s+|^{tag}\s+|\s+{tag}$` is equivalent to `\b{tag}\b` where `\b` is word boundary. 2) why do you run regex.search second time on success? Although i think both improvements have relatively minor effect — Yaroslav Fyodorov, Dec 14 '20 at 10:08
2) I don't think your regex works at all - tried it on 'ALEXANDER' and it matches. So, you have problems with correctness first of all. I don't think you can easily do "upto 1 substitution,deletion or addition" with regex easily. — Yaroslav Fyodorov, Dec 14 '20 at 10:17
As of now, its working upto 1 substitution. Addition / deletion is a nice to have. I have edited the script to so regex.search only once,.. as u said.. not much improvement though — StarMunch, Dec 14 '20 at 10:19
Regarding correction - I think you need to do it like this `tags_ptn_regex = '|'.join(tags_regex)` and then `regex.search("\\b("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}\\b",partyname)` otherwise as I said it finds match on "Alexander" by ignoring `\s+`, besides it's a bit simpler and perhaps faster. In spark world I would suggest to split your NAMES into words, then explode, and then match on single words, but I am not sure whether it will do any good in pandas - your match criteria is complicated after all. — Yaroslav Fyodorov, Dec 14 '20 at 10:33
@YaroslavFyodorov .. Thanks for that correction! Unfortunately.. this is on a dedicated Py server : ( — StarMunch, Dec 14 '20 at 10:39
On general principles, multithreading seems to be applicable here either by splitting your data frame into several chunks and doing them independently, then joining the results or by splitting your REF_NAMES and running them separately. Not sure, about python specifically. Can you run several python processes and pass to each of them part of your df? — Yaroslav Fyodorov, Dec 14 '20 at 10:44
I would try to estimate a lower bound for runtime, by running w/o substitutions (and perhaps only part of REF_NAMES). If that is above your 30 mins limit ... — Yaroslav Fyodorov, Dec 14 '20 at 10:47
Without the fuzzy bit.. its well under 10 mins, but the fuzzy part helps with the accuracy — StarMunch, Dec 14 '20 at 10:58

score 3 · Accepted Answer · answered Dec 14 '20 at 12:48

Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S), and you will only need one tag pattern.

Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.

To reduce this backtracking, you will need a regex trie.

Here is a working snippet:

import regex
import pandas as pd

## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
    """Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
    The corresponding Regex should match much faster than a simple Regex union."""

    def __init__(self):
        self.data = {}

    def add(self, word):
        ref = self.data
        for char in word:
            ref[char] = char in ref and ref[char] or {}
            ref = ref[char]
        ref[''] = 1

    def dump(self):
        return self.data

    def quote(self, char):
        return regex.escape(char)

    def _pattern(self, pData):
        data = pData
        if "" in data and len(data.keys()) == 1:
            return None

        alt = []
        cc = []
        q = 0
        for char in sorted(data.keys()):
            if isinstance(data[char], dict):
                try:
                    recurse = self._pattern(data[char])
                    alt.append(self.quote(char) + recurse)
                except:
                    cc.append(self.quote(char))
            else:
                q = 1
        cconly = not len(alt) > 0

        if len(cc) > 0:
            if len(cc) == 1:
                alt.append(cc[0])
            else:
                alt.append('[' + ''.join(cc) + ']')

        if len(alt) == 1:
            result = alt[0]
        else:
            result = "(?:" + "|".join(alt) + ")"

        if q:
            if cconly:
                result += "?"
            else:
                result = "(?:%s)?" % result
        return result

    def pattern(self):
        return self._pattern(self.dump())

## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row

trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
    trie.add(word)

tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)

def search_it(partyname):
    m = tags_ptn_regex.search(partyname)
    if m is not None:
        return m.group()
    else:
        return None
    
df['MATCH'] = df['NAMES'].apply(search_it)

Impressive. I thought about suggesting to build a finite automaton but then decided it's too complicated and wouldn't be efficient anyway. This is nice way to do it. I wonder how fast is it on this particular data. — Yaroslav Fyodorov, Dec 16 '20 at 14:39
@YaroslavFyodorov Surely, the fuzzy quantifier slows things down, but the approach itself is proven to work well with up to 50K search phrases of different length. — Wiktor Stribiżew, Dec 16 '20 at 14:58

Fuzzy regex match on million rows Pandas df

1 Answers1

Linked

Related