I am using Dedupe python package to check for duplicates for my incoming records. I have trained approx. 500000 records from a CSV file. Using the Dedupe package, I have clustered the 500000 records into different clusters. I have attempted to use the settings_file
got out of training to do dedupe for the new record(data
in the code). I have shared a code snippet below.
import dedupe
from unidecode import unidecode
import os
deduper=None
if os.path.exists(settings_file):
with open(settings_file, 'rb') as sf :
deduper = dedupe.StaticDedupe(sf)
clustered_dupes = deduper.match(data, 0)
data, here is a single new record that I have to check if it has a duplicate or not. data
looks like
{1:{'SequenceID': 6855406, 'ApplicationID': 7065902, 'CustomerID': 6153222, 'Name': 'X', 'col1': '-42332423', 'col2': '0', 'col3': '0', 'col4': '0', 'col5': '24G0859681', 'col6': '0', 'col7': 'xyz12345', 'col8': 'xyz', 'col9': '1234', 'col10': 'xyz10'}}
This throws an error.
No records have been blocked together. Is the data you are trying to match like the data you trained on?
How do I use this clustered data to check if new record is a duplicate or not? Is it possible to do like we would do with any ML model? I have looked into multiple sources but haven't found the solution to this problem.
Most of the sources talk about training and not about how I use the clustered data to check for a single record.
Is there another way out.
Some links that I have referred: link1 link2 link3
Any help is appreciated.