3

I'm using dedupe python library.

Any code sample will do, for example this.

Let's say I have a trained deduper and used it it to deduplicate a dataset successfully.

Now I add one new row to the dataset.

I want to check if this new row is a duplicate or not.

Is there a way to do that in dedupe (without reclassifying the whole dataset)?

Update: I have tried @libreneitor suggestion, but I just get No records have been blocked together. Is the data you are trying to match like the data you trained on? Here's my code (csv file):

import csv
import exampleIO
import dedupe

def canonicalImport(filename):
    preProcess = exampleIO.preProcess
    data_d = {}
    with open(filename) as f:
        reader = csv.DictReader(f)
        for (i, row) in enumerate(reader):
            clean_row = {k: preProcess(v) for (k, v) in
                         viewitems(row)}
            data_d[i] = clean_row
    return data_d, reader.fieldnames

raw_data = 'tests/datasets/restaurant-nophone-training.csv'

data_d, header = canonicalImport(raw_data)

training_pairs = dedupe.trainingDataDedupe(data_d, 'unique_id', 5000)

fields = [{'field': 'name', 'type': 'String'},
              {'field': 'name', 'type': 'Exact'},
              {'field': 'address', 'type': 'String'},
              {'field': 'cuisine', 'type': 'ShortString',
               'has missing': True},
              {'field': 'city', 'type': 'ShortString'}
              ]

deduper = dedupe.Gazetteer(fields, num_cores=5)
deduper.sample(data_d, 10000)
deduper.markPairs(training_pairs)
deduper.train(index_predicates=False)

alpha = deduper.threshold(data_d, 1)

data_d_test = {}
data_d_test[0] = data_d[0]
del data_d[0];

clustered_dupes = deduper.match(data_d, threshold=alpha)
clustered_dupes2 = deduper.match(data_d_test, threshold=alpha) <- exception here
rfg
  • 1,331
  • 1
  • 8
  • 24

1 Answers1

1

You can match a new row against your existing Dedupe.

But if you already achieved a deduplicated dataset you can use Gazetteer to add more unique data and then call match again.

hernan
  • 572
  • 4
  • 10
  • When I match 1 new row I get `No records have been blocked together. Is the data you are trying to match like the data you trained on?` error, despite the fact that it's very similar to the data that I matched previously. – rfg Jun 18 '19 at 13:32
  • I get the same error with `Gazetteer`. I call `match` and pass original dataset. After that I call `match` with just one row. I'm sure this one row is a duplicate. but I get this error. Do I call it wrong? – rfg Jun 18 '19 at 13:39