12

I have an index with multiple duplicate entries. They have different ids but the other fields have identical content.

For example:

{id: 1, content: 'content1'}
{id: 2, content: 'content1'}
{id: 3, content: 'content2'}
{id: 4, content: 'content2'}

After removing the duplicates:

{id: 1, content: 'content1'}
{id: 3, content: 'content2'}

Is there a way to delete all duplicates and keep only one distinct entry without manually comparing all entries?

fwind
  • 1,274
  • 4
  • 15
  • 32
  • using your own ids which ensure idempotence. This means with content: "content1", you should always have the same id – Julien C. Jun 01 '15 at 13:32
  • But that is not the case for me. I am working with a given index. In this index are multiple seperate entries which are holding the same content. Therefore I want to remove these duplicates. – fwind Jun 01 '15 at 13:36
  • How is your `content` field mapped? Is that a `string`, `analyzed`or `not_analyzed`? – Val Jun 02 '15 at 03:06
  • You can create another index with content being the id. Then migrate your existing index to the new index by means of either snapshot/restore or scan and scroll – isaac.hazan Jun 02 '15 at 08:44
  • What is causing duplicate entries in the first place? – jflay Jun 02 '15 at 18:46
  • @jiflay: There are duplicate entries in my data dump which I import – fwind Jun 03 '15 at 09:40
  • @yesterday: My content field is analyzed – fwind Jun 03 '15 at 09:40

3 Answers3

6

This can be accomplished in several ways. Below I outline two possible approaches:

1) If you don't mind generating new _id values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id for documents as they are written into the new collection. Since the _id field must be unique, any documents that have the same fingerprint will be written to the same _id and therefore deduplicated.

2) You can write a custom script that scrolls over your index. As each document is read, you can create a hash from the fields that you consider to define a unique document (in your case, the content field). Then use this hash as they key in a dictionary (aka hash table). The value associated with this key would be a list of all of the document's _ids that generate this same hash. Once you have all of the hashes and associated lists of _ids, you can execute a delete operation on all but one of the _ids that are associated with each identical hash. Note that this second approach does not require writing documents to a new index in order to de-duplicate, as you would delete documents directly from the original index.

I have written a blog post and code that demonstrates both of these approaches at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/

Disclaimer: I am a Consulting Engineer at Elastic.

Alexander Marquardt
  • 1,539
  • 15
  • 30
2

I use rails and if necessary I will import things with the FORCE=y command, which removes and re-indexes everything for that index and type... however not sure what environment you are running ES in. Only issue I can see is if the data source you are importing from (i.e. Database) has duplicate records. I guess I would see first if the data source could be fixed, if that is feasible, and you re-index everything; otherwise you could try to create a custom import method that only indexes one of the duplicate items for each record.

Furthermore, and I know this doesn't comply with you wanting to remove duplicate entries, but you could simply customize your search so that you are only returning one of the duplicate ids back, either by most recent "timestamp" or indexing deduplicated data and grouping by your content field -- see if this post helps. Even though this would still retain the duplicate records in your index, at least they won't come up in the search results.

I also found this as well: Elasticsearch delete duplicates

I tried thinking of many possible scenarios for you to see if any of those options work or at least could be a temp fix.

Community
  • 1
  • 1
jflay
  • 514
  • 9
  • 32
0

Here is a script I created based on Alexander Marquardt answer.

import hashlib
from elasticsearch import Elasticsearch, helpers

ES_HOST = 'localhost:9200'
es = Elasticsearch([ES_HOST])


def scroll_over_all_docs(index_name='squad_docs'):
    dict_of_duplicate_docs = {}

    index_docs_count = es.cat.count(index_name, params={"format": "json"})
    total_docs = int(index_docs_count[0]['count'])
    count = 0
    for hit in helpers.scan(es, index=index_name):
        
        count += 1
        
        text = hit['_source']['text']
        id = hit['_id']
        hashed_text = hashlib.md5(text.encode('utf-8')).digest()
        
        dict_of_duplicate_docs.setdefault(hashed_text,[]).append(id)
        
        if (count % 100 == 0):
            print(f'Progress: {count} / {total_docs}')

        
    return dict_of_duplicate_docs

def delete_duplicates(duplicates, index_name='squad_docs'):

    for hash, ids in duplicates.items():
    
        if len(ids) > 1:
            
            print(f'Number of docs: {len(ids)}. Number of docs to delete: {len(ids) -1}')
            for id in ids:
                if id == ids[0]:
                    continue
                res = es.delete(index=index_name, doc_type= '_doc', id=id)
                id_deleted = res['_id']
                results = res['result']
                print(f'Document id {id_deleted} status: {results}')

            reminder_doc = es.get(index=index_name, doc_type= '_all', id=ids[0])
            print('Reminder Document:')
            print(reminder_doc)     


def main():
    dict_of_duplicate_docs = scroll_over_all_docs()
    delete_duplicates(dict_of_duplicate_docs)

if __name__ == "__main__":
   main()
eboraks
  • 167
  • 1
  • 9