Questions tagged [python-dedupe]

Questions about the dedupe python library (a library for probabilistic deduplication and record linkage)

Dedupe is an open source, Python library for probabilistic dedupliction, record linkage, and entity resolution.

67 questions
8
votes
1 answer

Dedupe in Python

While going through the examples of the Dedupe library in Python which is used for records deduplication, I found out that it creates a Cluster Id column in the output file, which according to the documentation indicates which records refer to each…
Arnab
  • 1,037
  • 3
  • 16
  • 31
6
votes
2 answers

How do I link records to a large table efficiently using python Dedupe?

I'm trying to use the Dedupe package to merge a small messy data to a canonical table. Since the canonical table is very large (122 million rows), I can't load it all into memory. The current approach that I'm using based off this takes an entire…
Luke
  • 6,699
  • 13
  • 50
  • 88
5
votes
2 answers

Low resources usage when using dedupe python

I need to find duplicates in a large dataset, so I'm testing dedupe python library. I know it is recommended for small datasets, so I thought using a good machine could improve the performance. I have a machine with 56 GB RAM and I'm running a test…
mjimcua
  • 2,781
  • 3
  • 27
  • 47
4
votes
1 answer

dedupe in R equivalent

Is there an equivalent package available in R similar to the dedupe library in Python? The reason being is that I have used the package 'Record Linkage' in the past but when it comes to larger data-sets it seems to have a hard time. Dedupe seems to…
Rtab
  • 123
  • 10
3
votes
0 answers

Reusing Dedupe training for Gazetteer matching

I'm using the Dedupe library to clean up some data. However, once the first deduplication is done using the Dedupe object, I understand we are supposed to use the Gazetteer object to match any new incoming data against the clustered data. For the…
Hugo
  • 31
  • 1
3
votes
1 answer

Use python Dedupe package to check for single record

I am using Dedupe python package to check for duplicates for my incoming records. I have trained approx. 500000 records from a CSV file. Using the Dedupe package, I have clustered the 500000 records into different clusters. I have attempted to use…
Eswar
  • 1,201
  • 19
  • 45
3
votes
0 answers

How do I implement a custom comparator in the Python Dedupe library?

I'm using the so-far great Dedupe library to help link records from multiple providers. One of the fields I'm comparing is a phone number field. I'd like to use Google's phone number library to normalize these phone numbers. One other nice…
3
votes
1 answer

Dedupe one new row against existing dataset

I'm using dedupe python library. Any code sample will do, for example this. Let's say I have a trained deduper and used it it to deduplicate a dataset successfully. Now I add one new row to the dataset. I want to check if this new row is a duplicate…
rfg
  • 1,331
  • 1
  • 8
  • 24
3
votes
3 answers

What is the most efficient way to dedupe a Pandas dataframe that has typos?

I have a dataframe of names and addresses that I need to dedupe. The catch is that some of these fields might have typos, even though they are still duplicates. For example, suppose I had this dataframe: index name zipcode ------- …
CJ Sullivan
  • 53
  • 2
  • 6
3
votes
1 answer

Making Dedupe learn from existing label data

I am aware that Dedupe uses Active learning to remove duplicates and perform Record linkage. However , I would like to know if we can pass excel sheet with already matched pairs(label data) as the input for active learning?
2
votes
1 answer

Criss Cross Address Deduplication

I have a db table persons in which only persons details are captured. Say Name, Father Name, Email, DOB, Proof of Address, Proof of Identity ,Pincode etc. and I have an address table in which address of the persons are stored say Address, Pincode,…
2
votes
1 answer

String Similarity for all possible combination in Optimised fashion

I am facing a problem while finding string similarity. Scenario: The string which consisits of following fields first_name, middle_name and last_name What I have do is to find string similarity between A and B (both have same fields) but making sure…
2
votes
0 answers

Dedupe library, Blocking issue, Missing matches

I have a CSV file with 3M rows and two columns, it just Arabic Student_name and Id, I wanted to cluster similar names that refer to the same student, the names maybe have spelling typos or extra spaces as an example. In the clustered output, there…
2
votes
2 answers

AttributeError: 'Dedupe' object has no attribute 'sample'

I was running the csv_example.py from dedupe-examples. I got an error message as below File "csv_example.py", line 111, in deduper.sample(data_d, 15000) AttributeError: 'Dedupe' object has no attribute 'sample' Any help would be…
2
votes
0 answers

How to send an input to running process in Java?

In java 8, I start a process to run the python file using Process and ProcessBuilder When the process is running, it asks me for the input. I see in ProcessBuilder has a method redirectInput, I tried and it worked well. But the problem is the input…
1
2 3 4 5