3

I'm using the Dedupe library to clean up some data. However, once the first deduplication is done using the Dedupe object, I understand we are supposed to use the Gazetteer object to match any new incoming data against the clustered data.

For the sake of explaining the issue, let's assume that :

  • The first batch of data is 500k rows of restaurants, with name, address, and phone number fields.
  • The second batch of data is, for instance, 1k new restaurants that did not exist at the time, but that I now want to match against the first 500k.

If I describe the pipeline, it goes something like this :

  • Step 1) Initial deduplication
    • Train a Dedupe object on a sample of the 500k restaurants
    • Cluster the 500k rows with a Dedupe / Static Dedupe object
  • Step 2) Incremental deduplication
    • Train a Gazetteer object on a sample of the 500k restaurants vs 1k new restaurants
    • Match incoming 1k rows against 500k previous rows
    • Assign canonical ID according to the 1k rows that actually matched an existing restaurant

So, the questions are :

  • Is the pipeline actually correct ?
  • Do I have to retrain the Gazetteer each time new data comes in ?
    • Can't I use the same blocking rules that I learned during the first step ? Or at least the same labelled pairs ? Assuming of course the fields are the same, and the data goes through exactly the same preprocessing.
  • I understand I could keep redoing step 1, but from what I read, is not the best practice.

@fgregg I went through all the Stackoverflow and Github issues (most recent one being this one), but could not find any helpful answers.

Thanks !

Hugo
  • 31
  • 1

0 Answers0