I'm using dedupe python library.
Any code sample will do, for example this.
Let's say I have a trained deduper
and used it it to deduplicate a dataset successfully.
Now I add one new row to the dataset.
I want to check if this new row is a duplicate or not.
Is there a way to do that in dedupe (without reclassifying the whole dataset)?
Update:
I have tried @libreneitor suggestion, but I just get No records have been blocked together. Is the data you are trying to match like the data you trained on?
Here's my code (csv file):
import csv
import exampleIO
import dedupe
def canonicalImport(filename):
preProcess = exampleIO.preProcess
data_d = {}
with open(filename) as f:
reader = csv.DictReader(f)
for (i, row) in enumerate(reader):
clean_row = {k: preProcess(v) for (k, v) in
viewitems(row)}
data_d[i] = clean_row
return data_d, reader.fieldnames
raw_data = 'tests/datasets/restaurant-nophone-training.csv'
data_d, header = canonicalImport(raw_data)
training_pairs = dedupe.trainingDataDedupe(data_d, 'unique_id', 5000)
fields = [{'field': 'name', 'type': 'String'},
{'field': 'name', 'type': 'Exact'},
{'field': 'address', 'type': 'String'},
{'field': 'cuisine', 'type': 'ShortString',
'has missing': True},
{'field': 'city', 'type': 'ShortString'}
]
deduper = dedupe.Gazetteer(fields, num_cores=5)
deduper.sample(data_d, 10000)
deduper.markPairs(training_pairs)
deduper.train(index_predicates=False)
alpha = deduper.threshold(data_d, 1)
data_d_test = {}
data_d_test[0] = data_d[0]
del data_d[0];
clustered_dupes = deduper.match(data_d, threshold=alpha)
clustered_dupes2 = deduper.match(data_d_test, threshold=alpha) <- exception here