2

I have been now trying for a while to make a working example of the gazetteer/dedupe that scales to semi-large datasets connecting to SQL (using examples provided by the package) and have been unsuccessful. Would really appreciate if anyone could provide me with some help or share their working samples.

Things I have tried so far:

  • I have tried the SQL example. I had to break some of the sql codes to separate create and insert statements to meet GTID standards but everything else follows the example. The issue I have with this is when it gets to the clustering part (after seemingly successfully running up to that point) and gives me the following error:
    "dedupe.core.BlockingError: No records have been blocked together. Is the data you are trying to match like the data you trained on?" No matter what I did, this was not fixed ( I am training and testing on same data exactly so this error does not make sense to me.)

  • For large scale gazetteer I have tried using this example to begin with, but this is the error I get: "TypeError: train() takes at most 3 arguments (4 given)". The only change I have made here is that I am connecting to a mysql db. Also, I cannot find any guidance on how to actually scale all parts of gazetteer matching (or just do not understand how this example is helping with that).

Has anyone been able to actually scale these to large data using mysql?

Please let me know if I need to provide more info or code snippets.

Thanks in advance.

mersa
  • 85
  • 1
  • 9
  • The comment says "unduplicated". Why dedupe? – Rick James May 01 '18 at 23:10
  • I think that is a typo. – mersa May 02 '18 at 15:30
  • Can you use `SELECT DISTINCT` instead of using python code? – Rick James May 02 '18 at 16:44
  • For deduping? that would be no good. There are a lot of cases where 2 rows are about the same entities but have minor typos. – mersa May 02 '18 at 20:36
  • I have actually managed to resolve most issues with the examples. The only remaining issue is here : https://stackoverflow.com/questions/50051487/values-are-not-inserted-into-mysql-table-using-pool-apply-async-in-python2-7 – mersa May 02 '18 at 20:40
  • Hello @mersa, please can you share how you achieved the gazetteer scaling with MySQL? I have been searching desperately for something like this. Please! – Sleek Aug 24 '18 at 17:05
  • @mersa kindly check this thread if there is any help you can offer: https://github.com/dedupeio/dedupe/issues/691 – Sleek Aug 24 '18 at 17:39

0 Answers0