18

I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence.

I am made aware there are libraries which I can leverage on, such as the FuzzyWuzzy module in Python.

However in terms of processing it seems it will take up too much resources having every string in 1 list to be compared to the other, which in this case seems to require 1 million multiplied by another million number of iterations.

Are there any other more efficient methods for this problem?

UPDATE:

So I created a bucketing function and applied a simple normalization of removing whitespace, symbols and converting the values to lowercase etc...

for n in list(dftest['YM'].unique()):
    n = str(n)
    frame = dftest['Name'][dftest['YM'] == n]
    print len(frame)
    print n
    for names in tqdm(frame):
            closest = process.extractOne(names,frame)

By using pythons pandas, the data is loaded to smaller buckets grouped by years and then using the FuzzyWuzzy module, process.extractOne is used to get the best match.

Results are still somewhat disappointing. During test the code above is used on a test data frame containing only 5 thousand names and takes up almost a whole hour.

The test data is split up by.

  • Name
  • Year Month of Date of Birth

And I am comparing them by buckets where their YMs are in the same bucket.

Could the problem be because of the FuzzyWuzzy module I am using? Appreciate any help.

BernardL
  • 5,162
  • 7
  • 28
  • 47
  • 1
    You can try to normalize names and compare normalized forms. That becomes quite parallelizable. – Alec Aug 16 '16 at 07:54
  • How would you recommend on going about normalization? Is there a standard method I can apply? Appreciate your feedback. – BernardL Aug 16 '16 at 07:55
  • Well it depends on the sort of naming differences? As a simple example, matching company names, I might strip away things phrases like `LTD` or `INC` and maybe even non-letters. – Alec Aug 16 '16 at 07:57
  • That is the difficulty with names I guess, we can normalize but this will not help reducing the number of iterations. – BernardL Aug 16 '16 at 07:59
  • Well no, it will. You normalize all records, then consider records to be duplicates of each other if their normalized forms are equal. This is just one big map reduce. – Alec Aug 16 '16 at 08:03
  • You can try `unique`ing both lists, to hopefully reduce both by some orders of magnitude, so that the quadratic (fuzzy) comparisons won't be as painful. – acdr Aug 16 '16 at 08:04
  • 1
    Can you post some data from your list, that would be helpful in reaching a solution. What are the slightly different naming conventions for example ? What two names are considered a 95% match ? – DhruvPathak Aug 16 '16 at 09:34
  • If you don't mind a `O(mn)` run time, you could compute the Levenshtein distance between the two strings and see if they're within your threshold of acceptance – Nick Zuber Aug 16 '16 at 11:43
  • @NickZuber: O(mn) when m and n are > 10^6 is on the order of a trillion. And considering that Levenshtein distance is O(xy), where x and y are the lengths of the individual strings, I suspect that solution would take many days to complete. Even if you were willing to wait days, I doubt that Levenshtein distance would produce useful results. – Jim Mischel Aug 16 '16 at 15:25
  • You need to give us more information about your data. Are you trying to do a one-to-one mapping between the two lists? Or are there multiple items in one list that can be considered the same? What is the nature of the differences? Does one have abbreviations? Underscores or other word separators? Are you trying to correct misspellings? The more we know about the nature of your data, the better we can provide suggestions. – Jim Mischel Aug 16 '16 at 15:28
  • @JimMischel n and m are representative of the respective string lengths, sorry if that wasn't clear. And the strings aren't millions of characters long.. You can compute millions of strings with the Levenshtein algorithm in seconds – Nick Zuber Aug 16 '16 at 16:02
  • @NickZuber: But you have to make on the order of 10^12 Levenshtein calculations to compare every string in list A with every string in list B. Even if you could make a million a second, that's still about 12 days. – Jim Mischel Aug 16 '16 at 16:27
  • @JimMischel oh yeah, doing _trillions_ of words might take a while :) – Nick Zuber Aug 16 '16 at 17:01
  • Hi all, trying to do a one to one map. I am alright with a simple string match logic, such as comparing 'abcd' and 'abcf' will return a 75% confidence. I updated my question with some trials. Still facing processing time issues. – BernardL Aug 21 '16 at 07:13
  • @BernardL Can you please share your full code, along with a basic profiling using time.time() to show how much time each segment is taking ? – DhruvPathak Sep 19 '16 at 10:28
  • @BernardL Have also added a section "Beginning Match" to optimize this more. – DhruvPathak Sep 19 '16 at 10:33

3 Answers3

19

There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.

  • Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:

    • lowercase conversion,
    • no whitespaces, special characters removal,
    • transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module )

    This would result in "Andrew H Smith", "andrew h. smith", "ándréw h. smith" generating same key "andrewhsmith", and would reduce your set of million names to a smaller set of unique/similar grouped names.

You can use this utlity method to normalize your string (does not include the unicode part though) :

def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
    """ Processes string for similarity comparisons , cleans special characters and extra whitespaces
        if normalized is True and removes the substrings which are in ignore_list)
    Args:
        input_str (str) : input string to be processed
        normalized (bool) : if True , method removes special characters and extra whitespace from string,
                            and converts to lowercase
        ignore_list (list) : the substrings which need to be removed from the input string
    Returns:
       str : returns processed string
    """
    for ignore_str in ignore_list:
        input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)

    if normalized is True:
        input_str = input_str.strip().lower()
        #clean special chars and extra whitespace
        input_str = re.sub("\W", "", input_str).strip()

    return input_str
  • Now similar strings will already lie in the same bucket if their normalized key is same.

  • For further comparison, you will need to compare the keys only, not the names. e.g andrewhsmith and andrewhsmeeth, since this similarity of names will need fuzzy string matching apart from the normalized comparison done above.

  • Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.

  • Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith, ándréw h. smith, and Andrew H. Smeeth generate keys andrewhsmith,andrewhsmith, and andrewhsmeeth respectively. They will usually not differ in the first character, so you can run matching for keys starting with a to other keys which start with a, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith to bndrewhsmith as such a name variation with first letter will rarely exist.

Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:

def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
    """ Calculates matching ratio between two strings
    Args:
        first_str (str) : First String
        second_str (str) : Second String
        normalized (bool) : if True ,method removes special characters and extra whitespace
                            from strings then calculates matching ratio
        ignore_list (list) : list has some characters which has to be substituted with "" in string
    Returns:
       Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
                    using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
                    equal weightage to each
    Examples:
        >>> find_string_similarity("hello world","Hello,World!",normalized=True)
        1.0
        >>> find_string_similarity("entrepreneurship","entreprenaurship")
        0.95625
        >>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
        1.0
    """
    first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
    second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
    match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
    return match_ratio
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • 1
    Hey there, tried your method with bucketing and normalizing. Which made a lot of sense, however still stuck on processing time which takes far too long. Almost an hour to process 5 thousand names. Currently using the FuzzyWuzzy module, with the process.extractOne function. Any help is much appreciated. – BernardL Aug 21 '16 at 07:16
4

You have to index, or normalize the strings to avoid the O(n^2) run. Basically, you have to map each string to a normal form, and to build a reverse dictionary with all the words linked to corresponding normal forms.

Let's consider that normal forms of 'world' and 'word' are the same. So, first build a reversed dictionary of Normalized -> [word1, word2, word3], e.g.:

"world" <-> Normalized('world')
"word"  <-> Normalized('wrd')

to:

Normalized('world') -> ["world", "word"]

There you go - all the items (lists) in the Normalized dict which have more than one value - are the matched words.

The normalization algorithm depends on data i.e. the words. Consider one of the many:

  • Soundex
  • Metaphone
  • Double Metaphone
  • NYSIIS
  • Caverphone
  • Cologne Phonetic
  • MRA codex
Zaur Nasibov
  • 22,280
  • 12
  • 56
  • 83
  • Understand, will post an update if I get the solution working. Indexing based on other information and normalizing seems to a good approach. – BernardL Aug 16 '16 at 08:36
3

Specific to fuzzywuzzy, note that currently process.extractOne defaults to WRatio which is by far the slowest of their algorithms, and processor defaults to utils.full_process. If you pass in say fuzz.QRatio as your scorer it will go much quicker, but not as powerful depending on what you're trying to match. May be just fine for names though. I personally have good luck with token_set_ratio which is at least somewhat quicker than WRatio. You can also run utils.full_process() on all your choices beforehand and then run it with fuzz.ratio as your scorer and processor=None to skip the processing step. (see below) If you're just using the basic ratio function fuzzywuzzy is probably overkill though. Fwiw I have a JavaScript port (fuzzball.js) where you can pre-calculate the token sets too and use those instead of recalculating each time.)

This doesn't cut down the sheer number of comparisons but it helps. (BK-tree for this possibly? Been looking into same situation myself)

Also be sure to have python-Levenshtein installed so you use the faster calculation.

**The behavior below may change, open issues under discussion etc.**

fuzz.ratio doesn't run full process, and the token_set and token_sort functions accept a full_process=False param, and If you don't set Processor=None the extract function will try to run full process anyway. Can use functools' partial to say pass in fuzz.token_set_ratio with full_process=False as your scorer, and run utils.full_process on your choices beforehand.

nbkap
  • 62
  • 4