Questions tagged [deduplication]

De-duplication is the process of removing duplicated or redundant data from a database.

139 questions
32
votes
1 answer

Remove duplicate documents from a search in Elasticsearch

I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field. Aggregators will come to me as counters. I would like a list of documents. My index : Doc 1 {domain: 'domain1.fr', name: 'name1',…
Bastien D
  • 1,395
  • 2
  • 14
  • 26
20
votes
3 answers

Java 8 String deduplication vs. String.intern()

I am reading about the feature in Java 8 update 20 for String deduplication (more info) but I am not sure if this basically makes String.intern() obsolete. I know that this JVM feature needs the G1 garbage collector, which might not be an option…
Hilikus
  • 9,954
  • 14
  • 65
  • 118
12
votes
1 answer

Remove duplicates from list based on multiple fields or columns

I have a list of type MyClass public class MyClass { public string prop1 {} public int prop2 {} public string prop3 {} public int prop4 {} public string prop5 {} public string prop6 {} .... } This list will have…
user20358
  • 14,182
  • 36
  • 114
  • 186
12
votes
3 answers

sbt assembly error - deduplicate: different file contents found in the following

I get the following error when I do a ./sbt assembly on my Scala project. I saw the first after adding these dependencies to my build.sbt I can compile and run my code. libraryDependencies ++= Seq( "org.scalanlp" % "breeze_2.10" % "0.7", …
Soumya Simanta
  • 11,523
  • 24
  • 106
  • 161
10
votes
3 answers

What are some of the best hashing algorithms to use for data integrity and deduplication?

I'm trying to hash a large number of files with binary data inside of them in order to: (1) check for corruption in the future, and (2) eliminate duplicate files (which might have completely different names and other metadata). I know about md5 and…
King Spook
  • 381
  • 4
  • 10
6
votes
2 answers

Java: a time-delayed queue that de-dupes

G'day everyone, I have a system (the source) that needs to notify another system (the target) asynchronously whenever certain objects change. The twist is that the source system may mutate a single object many times in a short interval (updates are…
Peter
  • 519
  • 6
  • 15
6
votes
3 answers

How to store bidirectional relationships

I am writing some code to find duplicate customer details in a database. I'll be using Levenshtein distance. However, I am not sure how to store the relationships. I use databases all the time but have never come accross this situation and wondered…
alj
  • 2,839
  • 5
  • 27
  • 37
6
votes
3 answers

Email deduplication

is it true that e-mail can be deduplicated by just using some of their headers as according to RFC their message-id should be unique? Is there any way to calculate the chance of 1 single email beeing missed in this deduplication method below (sha512…
Floris
  • 299
  • 3
  • 17
6
votes
3 answers

Deduping database records comparing values in numerous fields

So I'm trying to clean some phone records in a database table. I've found out how to find exact matches in 2 fields using: /* DUPLICATE first & last names */ SELECT `First Name`, `Last Name`, COUNT(*) c FROM phone.contacts GROUP…
5
votes
1 answer

Data deduplication framework?

I want to integrate data deduplication into software that I am writing to back up vmware images. I haven't been able to find anything suitable for what I think I need. There seem to be a LOT of complete solutions that include one form of…
stifin
  • 1,390
  • 3
  • 18
  • 28
5
votes
5 answers

Bad Performance for Dedupe of 2 million records using mapreduce on Appengine

I have about 2 million records which have about 4 string fields each which needs to be checked for duplicates. To be more specific I have name, phone, address and fathername as fields and I must check for dedupe using all these fields with rest of…
5
votes
4 answers

bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff. This file doesn't change very frequently though, so I want to…
aidan
  • 9,310
  • 8
  • 68
  • 82
5
votes
4 answers

Deduplicate this java code duplication

I have about 10+ classes, and each one has a LUMP_INDEX and SIZE static constant. I want an array of each of these classes, where the size of the array is calculated using those two constants. At the moment i have a function for each class to create…
terryhau
  • 549
  • 2
  • 9
  • 18
5
votes
1 answer

Java Set with multiple equality criteria

I have a particular requirement where I need to dedupe a list of objects based on a combination of equality criteria. e.g. Two Student objects are equal if: 1. firstName and id are same OR 2. lastName, class, and emailId are same I was planning to…
Suraj Bajaj
  • 6,630
  • 5
  • 34
  • 49
5
votes
2 answers

mysql efficient join of 2 tables to the same 2 tables

I have 2 tables that can be simplified to this structure: Table 1: +----+----------+---------------------+-------+ | id | descr_id | date | value | +----+----------+---------------------+-------+ | 1 | 1 | 2013-09-20 16:39:06…
Eric Fitting
  • 160
  • 1
  • 7
1
2 3
9 10