Elasticsearch delete duplicates

Question

Some of the records are duplicated in my index identified by a numeric field recordid.

There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?

Or some other way to achieve this?

Tombart · Answer 1 · 2017-06-28T08:26:40.940

5

Yes, you can find duplicated document with an aggregation query:

curl -XPOST http://localhost:9200/your_index/_search -d '
 {
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "recordid",
        "min_doc_count": 2,
        "size": 10
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}'

then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).

NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).

edited Jun 28 '17 at 08:26

answered Mar 28 '17 at 15:31

Tombart

30,520
16
123
136

Am getting out of memory error, can we add date range to get duplicates from particular date range? – Jeeva N Jun 28 '17 at 07:16
@JeevaN Yes, we can try that, though I'm not sure if it will help with really large indexes. Feel free to submit a PR. What is you configuration (index size, number of master-eligible nodes and number of data nodes)? Do you split you indexes e.g. by day? – Tombart Jun 28 '17 at 08:36

Andy · Answer 2 · 2015-07-14T03:06:41.410

1

Elasticsearch recommends "use(ing) the scroll/scan API to find all matching ids and then issue a bulk request to delete them".

**Edited

edited Jul 14 '15 at 03:06

answered Nov 10 '14 at 23:00

Andy

8,841
8
45
68

you can't use [size] when using the delete_by_query method – Trent Jul 13 '15 at 19:43
@Trent good call. Updated with the current recommendation for doing large deletes. – Andy Jul 14 '15 at 03:07

score 1 · Answer 3 · answered Dec 10 '15 at 00:59

The first challenge here would be to identify the duplicate documents. For that you need to run a terms aggregation on the fields that defines the uniqueness of the document. On the second level of aggregation use top_hits to get the document ID too. Once you are there , you will get the ID of documents having duplicates.

Now you can safely remove them , may be using Bulk API.

You can read of other approaches to detect and remove duplicate documents here.

Elasticsearch delete duplicates

3 Answers3

Linked