De-duplication is the process of removing duplicated or redundant data from a database.
Questions tagged [deduplication]
139 questions
32
votes
1 answer
Remove duplicate documents from a search in Elasticsearch
I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field.
Aggregators will come to me as counters. I would like a list of documents.
My index :
Doc 1 {domain: 'domain1.fr', name: 'name1',…

Bastien D
- 1,395
- 2
- 14
- 26
20
votes
3 answers
Java 8 String deduplication vs. String.intern()
I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern() obsolete.
I know that this JVM feature needs the G1 garbage collector, which might not be an option…

Hilikus
- 9,954
- 14
- 65
- 118
12
votes
1 answer
Remove duplicates from list based on multiple fields or columns
I have a list of type MyClass
public class MyClass
{
public string prop1 {}
public int prop2 {}
public string prop3 {}
public int prop4 {}
public string prop5 {}
public string prop6 {}
....
}
This list will have…

user20358
- 14,182
- 36
- 114
- 186
12
votes
3 answers
sbt assembly error - deduplicate: different file contents found in the following
I get the following error when I do a ./sbt assembly on my Scala project. I saw the first after adding these dependencies to my build.sbt I can compile and run my code.
libraryDependencies ++= Seq(
"org.scalanlp" % "breeze_2.10" % "0.7",
…

Soumya Simanta
- 11,523
- 24
- 106
- 161
10
votes
3 answers
What are some of the best hashing algorithms to use for data integrity and deduplication?
I'm trying to hash a large number of files with binary data inside of them in order to:
(1) check for corruption in the future, and
(2) eliminate duplicate files (which might have completely different names and other metadata).
I know about md5 and…

King Spook
- 381
- 4
- 10
6
votes
2 answers
Java: a time-delayed queue that de-dupes
G'day everyone,
I have a system (the source) that needs to notify another system (the target) asynchronously whenever certain objects change. The twist is that the source system may mutate a single object many times in a short interval (updates are…

Peter
- 519
- 6
- 15
6
votes
3 answers
How to store bidirectional relationships
I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance.
However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered…

alj
- 2,839
- 5
- 27
- 37
6
votes
3 answers
Email deduplication
is it true that e-mail can be deduplicated by just using some of their headers as according to RFC their message-id should be unique?
Is there any way to calculate the chance of 1 single email beeing missed in this deduplication method below (sha512…

Floris
- 299
- 3
- 17
6
votes
3 answers
Deduping database records comparing values in numerous fields
So I'm trying to clean some phone records in a database table.
I've found out how to find exact matches in 2 fields using:
/* DUPLICATE first & last names */
SELECT
`First Name`,
`Last Name`,
COUNT(*) c
FROM phone.contacts
GROUP…

Still_Learning
- 63
- 6
5
votes
1 answer
Data deduplication framework?
I want to integrate data deduplication into software that I am writing to back up vmware images. I haven't been able to find anything suitable for what I think I need. There seem to be a LOT of complete solutions that include one form of…

stifin
- 1,390
- 3
- 18
- 28
5
votes
5 answers
Bad Performance for Dedupe of 2 million records using mapreduce on Appengine
I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of…

charming30
- 171
- 10
5
votes
4 answers
bash scripting de-dupe
I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to…

aidan
- 9,310
- 8
- 68
- 82
5
votes
4 answers
Deduplicate this java code duplication
I have about 10+ classes, and each one has a LUMP_INDEX and SIZE static constant.
I want an array of each of these classes, where the size of the array is calculated using those two constants.
At the moment i have a function for each class to create…

terryhau
- 549
- 2
- 9
- 18
5
votes
1 answer
Java Set with multiple equality criteria
I have a particular requirement where I need to dedupe a list of objects based on a combination of equality criteria.
e.g. Two Student objects are equal if:
1. firstName and id are same OR 2. lastName, class, and emailId are same
I was planning to…

Suraj Bajaj
- 6,630
- 5
- 34
- 49
5
votes
2 answers
mysql efficient join of 2 tables to the same 2 tables
I have 2 tables that can be simplified to this structure:
Table 1:
+----+----------+---------------------+-------+
| id | descr_id | date | value |
+----+----------+---------------------+-------+
| 1 | 1 | 2013-09-20 16:39:06…

Eric Fitting
- 160
- 1
- 7