20

Let's say that I have an MDM system (Master Data Management), whose primary application is to detect and prevent duplication of records.

Every time a sales rep enters a new customer in the system, my MDM platform performs a check on existing records, computes the Levenshtein or Jaccard or XYZ distance between pair of words or phrases or attributes, considers weights and coefficients and outputs a similarity score, and so on.

Your typical fuzzy matching scenario.

I would like to know if it makes sense at all to apply machine learning techniques to optimize the matching output, i.e. find duplicates with maximum accuracy.
And where exactly it makes the most sense.

  • optimizing the weights of the attributes?
  • increase the algorithm confidence by predicting the outcome of the match?
  • learn the matching rules that otherwise I would configure into the algorithm?
  • something else?

There's also this excellent answer about the topic but I didn't quite get whether the guy actually made use of ML or not.

Also my understanding is that weighted fuzzy matching is already a good enough solution, probably even from a financial perspective, since whenever you deploy such an MDM system you have to do some analysis and preprocessing anyway, be it either manually encoding the matching rules or training an ML algorithm.

So I'm not sure that the addition of ML would represent a significant value proposition.

Any thoughts are appreciated.

Community
  • 1
  • 1
blackgreen
  • 34,072
  • 23
  • 111
  • 129
  • 1
    My intuition is that the incremental gain you would achieve would not justify the effort. What would be interesting is to use natural language processing/understanding to provide additional context when searching for possible duplicates, but it would be no small project! – ImDarrenG Apr 12 '17 at 10:36
  • 1
    If you do pursue this project one thing to watch will be the essentially binary outcome of your task (match vs no match), combined with potentially unbalanced dataset (more non-matches than matches). You could end up with a machine that looks very accurate, but is actually just telling you what you already know. – ImDarrenG Apr 12 '17 at 10:41
  • @fgregg: Wondering if you could use [tag:deduplication] instead of the brand-new [tag:record-linkage]. Seems to be the same concept. – Nathan Tuggy May 04 '17 at 03:52
  • @NathanTuggy, It seems to me that most of the questions tagged with deduplication is about removing exact matches. the techniques that you use for that are pretty different than the probabilistic approaches associated with record linkage – fgregg May 04 '17 at 14:04

2 Answers2

7

The main advantage of using machine learning is the time saving.

It is very likely that, given enough time, you could hand tune weights and come up with matching rules that are very good for your particular dataset. A machine learning approach could have a hard time outperforming your hand made system customized for a particular dataset.

However, this will probably take days to make a good matching system by hand. If you use an existing ML for matching tool, like Dedupe, then good weights and rules can be learned in an hour (including set up time).

So, if you have already built a matching system that is performing well on your data, it may not be worth investigating ML. But, if this is a new data project, then it almost certainly will be.

fgregg
  • 3,173
  • 30
  • 37
0

Traditionally, fuzzy record matching software suffer from requiring immense user involvement in project parameterization and clerical review. User is either required to provide various input parameters and threshold values, either to provide examples of matches and non-matches for machine learning. In both cases, considerable user involvement and expertise is prerequisite for successful analysis. The main value in using unsupervised machine learning is to let the software figur eout the solution automatically, without user involvement. There is at least one such fuzzy match software utilizing machine learning, called "ReMaDDer": http://remaddersoft.wixsite.com/remadder

zlatko
  • 596
  • 1
  • 6
  • 23