17

I have 2,500,000 product names and I want to try and group them together, i.e. find products that have similar names. For example, I could have three products:

  • Heinz Baked Beans 400g;
  • Hz Bkd Beans 400g;
  • Heinz Beans 400g.

that are actually the same product and can be merged together.

My plan was to use an implementation of Jaro–Winkler distance to find matches. The process works as follows:

  • make a big list of all the product names in memory;
  • pick the first product on the list;
  • compare it to every product that comes after it in the list and calculate a "Jaro Score";
  • any products that have a high match (say 0.95f or higher) are reported;
  • move to the next product.

So this has some optimisation in that it only matches each product one way around, saving half the processing time.

I coded this up and tested it. It works perfectly and found dozens of matches to investigate.

It takes roughly 20 seconds to compare 1 product to 2,500,000 other products and calculate the "Jaro Score". Assuming my calculations are correct this means it will take the best part of one year to complete the processing.

Obviously this isn't practical.

I have had colleagues go over the code and they managed to make a 20% improvement to the speed of the Jaro Score calculation part. They made the process multithreaded and that has made it a little bit faster. We also removed some of the pieces of information being stored, reducing it to just a list of product names and unique identifiers; this didn't seem to make any difference to the processing time.

With these improvements we still think this will take a number of months to process and we need it to take hours (or a few days at most).

I don't want to go into too much detail as I don't think this is entirely relevant, but I load the product details into a list:

private class Product
{
    public int MemberId;
    public string MemberName;
    public int ProductId;
    public string ProductCode;
    public string ProductName;
}
private class ProductList : List<Product> { }
private readonly ProductList _pl = new ProductList();

Then I use the following to process each product:

{Outer loop...
var match = _pl[matchCount];

for (int count = 1; count < _pl.Count; count++)
{
    var search = _pl[count];
    //Don't match products with themselves (redundant in a one-tailed match)
    if (search.MemberId == match.MemberId && search.ProductId == match.ProductId)
        continue;
    float jaro = Jaro.GetJaro(search.ProductName, match.ProductName);

    //We only log matches that pass the criteria
    if (jaro > target)
    {
        //Load the details into the grid
        var row = new string[7];
        row[0] = search.MemberName;
        row[1] = search.ProductCode;
        row[2] = search.ProductName;
        row[3] = match.MemberName;
        row[4] = match.ProductCode;
        row[5] = match.ProductName;
        row[6] = (jaro*100).ToString("#,##0.0000");
        JaroGrid.Rows.Add(row);
    }
}

I think for the purposes of this question we can assume that the Jaro.GetJaro method is a "black box", i.e. it doesn't really matter how it works as this part of the code has been optimised as much as possible and I can't think how it could be improved.

Any ideas for a better way to fuzzy match this list of products?

I am wondering if there is a "clever" way to pre-process the list to get most matches out at the start of the matching process. For example, if it takes 3 months to compare all products but only 3 days to compare "likely" products then we could live with this.

Okay, two common things are coming up. Firstly, yes, I do take advantage of a single tailed matching process. The real code for this is:

for (int count = matchCount + 1; count < _pl.Count; count++)

I regret posting the amended version; I was trying to simplify this a little (bad idea).

Secondly, a lot of people want to see the Jaro code, so here goes (it is rather long and it wasn't originally mine - I might have even found it on here somewhere?). By the way I love the idea of exiting before completion as soon as a bad match is indicated. I will start looking at that now!

using System;
using System.Text;

namespace EPICFuzzyMatching
{
    public static class Jaro
    {
        private static string CleanString(string clean)
        {
            clean = clean.ToUpper();
            return clean;
        }

        //Gets the similarity of the two strings using Jaro distance
        //param string1 the first input string
        //param string2 the second input string
        //return a value between 0-1 of the similarity
        public static float GetJaro(String string1, String string2)
        {
            //Clean the strings, we do some tricks here to help matching
            string1 = CleanString(string1);
            string2 = CleanString(string2);

            //Get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
            int halflen = ((Math.Min(string1.Length, string2.Length)) / 2) + ((Math.Min(string1.Length, string2.Length)) % 2);

            //Get common characters
            String common1 = GetCommonCharacters(string1, string2, halflen);
            String common2 = GetCommonCharacters(string2, string1, halflen);

            //Check for zero in common
            if (common1.Length == 0 || common2.Length == 0)
                return 0.0f;

            //Check for same length common strings returning 0.0f is not the same
            if (common1.Length != common2.Length)
                return 0.0f;

            //Get the number of transpositions
            int transpositions = 0;
            int n = common1.Length;
            for (int i = 0; i < n; i++)
            {
                if (common1[i] != common2[i])
                    transpositions++;
            }
            transpositions /= 2;

            //Calculate jaro metric
            return (common1.Length / ((float)string1.Length) + common2.Length / ((float)string2.Length) + (common1.Length - transpositions) / ((float)common1.Length)) / 3.0f;
        }

        //Returns a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1.
        //param string1
        //param string2
        //param distanceSep
        //return a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1
        private static String GetCommonCharacters(String string1, String string2, int distanceSep)
        {
            //Create a return buffer of characters
            var returnCommons = new StringBuilder(string1.Length);

            //Create a copy of string2 for processing
            var copy = new StringBuilder(string2);

            //Iterate over string1
            int n = string1.Length;
            int m = string2.Length;
            for (int i = 0; i < n; i++)
            {
                char ch = string1[i];

                //Set boolean for quick loop exit if found
                bool foundIt = false;

                //Compare char with range of characters to either side
                for (int j = Math.Max(0, i - distanceSep); !foundIt && j < Math.Min(i + distanceSep, m); j++)
                {
                    //Check if found
                    if (copy[j] == ch)
                    {
                        foundIt = true;
                        //Append character found
                        returnCommons.Append(ch);
                        //Alter copied string2 for processing
                        copy[j] = (char)0;
                    }
                }
            }
            return returnCommons.ToString();
        }
    }
}

Seeing as this question is still getting some views, I thought I would give a quick update on what happened:

  • I really wished I had originally posted the actual code I was using, as people are still telling me to half my iterations (obviously not reading beyond the first paragraph or so);
  • I took some of the suggestions made here, and some that other people outside SO came up with, and got the run time down to around 70 hours;
  • the main improvement was preprocessing the data, to only consider items that had a reasonably high number of sales attached to them. Not great, but it made the work load substantially smaller;
  • I had an issue with my laptop overheating, and so I ran most of the job with the laptop in my fridge over a weekend. In doing this I learned that fridges are NOT a good environment for a laptop (too damp), and my laptop died about a week later;
  • the net result was that I achieved what I set out to do, maybe not as comprehensively as I had hoped, but overall I would judge it as being a success;
  • why didn't I accept an answer? Well really none of the answers below entirely solved my initial problem, and although they mostly helped (some answers that came in years after I first posted this question certainly didn't help), I felt it was unfair to pick one as "THE ANSWER".
Richard Hansell
  • 5,315
  • 1
  • 16
  • 35
  • 1
    I think you will need to add several heuristics. You will know what matches typically look like better, but based on your example one might be to group on the first letter of each product, then only compare within the same letter group. That way each product only needs compared to 1/26th of the total number of items (assuming uniform distribution) – mao47 Sep 20 '13 at 12:17
  • 3
    Could you get a modified GetJaro method to return early when it becomes clear that the score is going to be too low? – Rob Sep 20 '13 at 12:21
  • 1
    What's with the `search.MemberId == match.MemberId && search.ProductId == match.ProductId` line? Do you have multiple entries in your master list that have the same `MemberId` and `ProductId` that are _different_ object instances and wouldn't satisfy a reference check (`Object.ReferenceEquals(match, search)`)? – Chris Sinclair Sep 20 '13 at 12:41
  • 3
    From `GetJaro` code, I can already see that you should be "pre-cleaning" your `ProductName` rather than doing that _every single time_. Do a single pass through your 2,500,000 items (or as you're building your list, do a `ToUpper` on their names, or store a `CleanedProductName`) This has GOT to be a big hit. EDIT: I mean, even minimizing the number of loops needed, that's got to be like `6,250,002,500,000` `ToUpper` calls along with all the string character iteration, string creation, and garbage collection. – Chris Sinclair Sep 20 '13 at 12:43
  • A small optimization, but you can do the Math.Min once and use the temporary result in `int halflen = ...` line. – mao47 Sep 20 '13 at 12:46
  • 2
    Can you `Normalize` your input data before doing any comparisons? Doing that is `O(n)` rather than `O(n^2)`. You may find that doing this makes some items string.Equal, which is quicker to check. I've no idea what Normal Form would look like in your domain, but it might involve using `ToUpper()`, correcting typos & substituting abbreviations, removing "and", etc. – Rob Sep 20 '13 at 12:47
  • 1
    It's hard to know what's faster than what else, or what the compiler does. Might a `(int)Math.Ceil` be faster than `+ ... % 2`? I really don't know, just something to try. – Tesserex Sep 20 '13 at 12:50
  • 1
    Another minor one, do the `if (common1.Length == 0) return 0.0f` check _before_ even calculating `common2`. No sense calculating it if you're going to throw it away anyway. Furthermore, it saves you a `0` check on `common2.Length` (you can then just check `common1.Length != common2.Length`. Also, I'm not _positive_, but returning a true/false value _might_ be slightly faster than returning a `float` and checking it outside against your ranked value. (Not feeling confident about that bit _at all_) – Chris Sinclair Sep 20 '13 at 12:56
  • Also, is this comparison something that you will only ever run once? If not, then there's probably a huge scope for saving time on subsequent runs over related data. A previous run on a food product list won't help you much when analyzing an address list, but might help you run on someone else's food product list. – Rob Sep 20 '13 at 12:56
  • @ma047 - I make no assumptions about the data being matched but I could certainly learn from running through the first hundred or so matches and use this to fine tune the process. Using the first letter of the product doesn't always work, e.g. "Baked Beans (Heinz)" would be missed out from my first example – Richard Hansell Sep 20 '13 at 13:12
  • @Chris - the test I put in is to prevent a match being made between a product and itself, this code can be removed as this will never happen due to the "start with the next product" optimisation. – Richard Hansell Sep 20 '13 at 13:13
  • @Chris - the ToUpper optmisation is in place now as is the early exist before calculating common2 (missed that one). I will look at your other suggestions and give some feedback. – Richard Hansell Sep 20 '13 at 13:14
  • @Rob, this is likely to be a one-off process. Ideally the process for creating a new product would include a test to see if a product with a very similar name already exists, but we are loading thousands of products into the system at a time. – Richard Hansell Sep 20 '13 at 13:16
  • Also - I guess (hope) this Jaro test is an `equivalence relation`. A small optimization: If you actually find a match `x~=y`, remove `y` from your list of items to check. Anything that `y` is equivalent to should also be equivalent to `x`. – Rob Sep 20 '13 at 13:17
  • @RichardHansell - I would definitely run on a subset of data to see the types of matches you have. I doubt the optimizations presented so far will get you to an acceptable level of performance. Sub-dividing the problem into smaller subsets could offer a pretty big savings. You may have to live with a small number of matches slipping through, but you'll have to design a heuristic based on your data to minimize that. -- I assume 400g and 800g products are different? This might be a better heuristic than my original first letter suggestion. Just some ideas to jump off of. – mao47 Sep 20 '13 at 13:20
  • Also, silly question maybe, but you are running this with `release` mode optimizations turned on, right? – Chris Sinclair Sep 20 '13 at 14:12
  • 1
    To give some more feedback, here are some of the things we tried and the performance increase seen that resulted. Pre-processing the product names gave 10-15% increase, ditching the Common2 resulted in a 1% improvement. The biggest improvement so far was to build a release version of the code rather than running it in Visual Studio - that gave a 50% increase in performance!! – Richard Hansell Sep 20 '13 at 14:16
  • According to my calculations you're already aware, but for anyone else facing a similar situation, keep in mind that by only doing each comparison once, that is, starting the inner loop at `matched + 1`, you will save _about_ half the time, as you go from doing n^2 operations to doing (n)(n-1)/2 operations. – mao47 Sep 20 '13 at 17:10
  • ...Meaning that the total time is ~ (2500000 products * 20 seconds per product * 1/2), which is nearly 300 days as opposed to 500-some days. – mao47 Sep 20 '13 at 17:23
  • 1
    After doing some benchmarking over the weekend the latest estimate is roughly 6,000 days to process the entire list; even with the improvements made already. However, I came up with a better system that works very nicely. I picked out all the words from the names ignoring punctuation and short words, etc. From this I generated a "top 100 words" list and used this to pre-filter the product names into (for example) products with "chicken" in the name, products with "tomato" in the name, etc. These smaller datasets process in around 5-10 minutes :D – Richard Hansell Sep 23 '13 at 15:02
  • Possible duplicate of [Fuzzy matching deduplication in less than exponential time?](http://stackoverflow.com/questions/7196053/fuzzy-matching-deduplication-in-less-than-exponential-time) – fgregg Sep 17 '16 at 16:02

3 Answers3

3

For a start it looks like "outer loop" is also looping over _pl, since you have a matchCount and then take a match out of it.

If I am correct in that, then your loop counter count should start at matchCount, so that you don't test a vs b and then later test b vs a again. It would eliminate your need to check an item for being itself at the top of the loop, and cut your number of iterations in half.

edit, another idea

Some people have said you should pre-process your match string so you aren't repeating operations like the ToUpper on it. If you do that, here's something else (possibly small) you can do.

Search your match string for double letters. If you find any, remove them from your match string, but mark that you did so (store a list of indexes where letters were doubled). Inside GetCommonCharacters, just add 1 to the loop end condition when comparing with the single remaining instance of that letter. Subsequent comparisons need to adjust for the missing letter as well. Specifically, make your loop go from i - distanceSep to i + distanceSep + 1 (keep the min and max checks of course).

Let's say your string1 contains "ee", with distanceSep of 1. Instead of 6 comparisons, you have 4, 33% savings. The savings is greater with higher distanceSep. If it were 2, you would drop from 10 to 6, 40% savings.

An example if that was confusing. string1 has "ee", string2 just has "abcd" so it won't match. distanceSep is 1. Instead of having to compare "e/a", "e/b", "e/c" ... and then "e/b", "e/c", "e/d", kill the second 'e' in string1 and only compare that e with all four letters.

Tesserex
  • 17,166
  • 5
  • 66
  • 106
  • Yes, this is covered in the question where I say: "compare it to every product that comes after it in the list and calculate a "Jaro Score";". This does indeed reduce the processing by one half. – Richard Hansell Sep 20 '13 at 12:35
  • @RichardHansell: It doesn't look like that though from the code that you've posted. – Chris Sinclair Sep 20 '13 at 12:39
  • 1
    @RichardHansell, your posted code does not use this optimization. If you change the inner loop to `for(int count = matchCount + 1; ...`, you won't double check items and you can also get rid of the `if` statement to check for the same item. – mao47 Sep 20 '13 at 12:40
  • Yes, apologies for that, I was trying to "simplify" the code but that turned out to be a very bad idea. I have updated the question to make this clearer. – Richard Hansell Sep 20 '13 at 12:47
3

IMHO you should definitely post the GetJaro() Code, as it is the part of your program that needs time.

It involves string operations and is executed millions of times. If StackOverflow users see more improvements which will only remove a fraction of the calculating time, this will bring more improvements to the overall time than removing a microsecond of the list handling.

tl;dr: optimize the code that takes time, not the loop around it.

edit: i have to put this into the answer. do not use floats, but use integer types. they are a lot faster as they do not need the FPU. Also i do suggest to normalize the input, as in ToUpper() or something to make all items "similar" in their appearance.

damaltor
  • 51
  • 3
  • also: do not use float variables and the generic "var" type, as integers and strongly typed variables are easier to handle (and thus faster). – damaltor Sep 20 '13 at 12:30
  • 3
    I believe `var` is equivalent to the corresponding type as it is resolved at compile time rather than at execution time. See: http://stackoverflow.com/questions/356846/c-sharp-var-vs-specific-type-performance – mao47 Sep 20 '13 at 12:32
  • I am not sure about that, as it would be theoretically possible to use the var container for an other type lateron. You might be right though. – damaltor Sep 20 '13 at 12:35
  • 3
    @damaltor: `var` will only ever represent the type resulting from the expression assigned to that variable. For example, `var i = 0;` will always be an `int`, _always_. `var d = 0.0;` will always be a `double`. They are _equivalent_ at compile-time to `int i = 0;` and `double d = 0.0;` – Chris Sinclair Sep 20 '13 at 12:37
1

The fundamental problem is that you are comparing every pair of records. That means the number of comparisons that you have to make is 0.5 * N * (N-1), or O(N^2).

You need to find a way to reduce this. There's a few ways to do this, but simplest thing to do is called "blocking." Basically, you break your data into groups of records that already have something in common, like common word or first three characters. Then you only compare records within a block.

In the worst case, the complexity is still O(N^2). In the best case it is O(N). Neither worst or best case will be seen in practice. Typically, blocking can reduce the number of comparisons by over 99.9%.

The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.

fgregg
  • 3,173
  • 30
  • 37