15

Some of the records are duplicated in my index identified by a numeric field recordid.

There is delete-by-query in elasticsearch, Can I use it to delete any one of the duplicate record?

Or some other way to achieve this?

Tombart
  • 30,520
  • 16
  • 123
  • 136
FUD
  • 5,114
  • 7
  • 39
  • 61

3 Answers3

5

Yes, you can find duplicated document with an aggregation query:

curl -XPOST http://localhost:9200/your_index/_search -d '
 {
  "size": 0,
  "aggs": {
    "duplicateCount": {
      "terms": {
        "field": "recordid",
        "min_doc_count": 2,
        "size": 10
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}'

then delete duplicated documents preferably using a bulk query. Have a look at es-deduplicator for automated duplicates removal (disclaimer: I'm author of that script).

NOTE: Aggregate queries could be very expensive and might lead to crash of your nodes (in case that your index is too large and number of data nodes too small).

Tombart
  • 30,520
  • 16
  • 123
  • 136
  • Am getting out of memory error, can we add date range to get duplicates from particular date range? – Jeeva N Jun 28 '17 at 07:16
  • @JeevaN Yes, we can try that, though I'm not sure if it will help with really large indexes. Feel free to submit a PR. What is you configuration (index size, number of master-eligible nodes and number of data nodes)? Do you split you indexes e.g. by day? – Tombart Jun 28 '17 at 08:36
1

Elasticsearch recommends "use(ing) the scroll/scan API to find all matching ids and then issue a bulk request to delete them".

**Edited

Andy
  • 8,841
  • 8
  • 45
  • 68
1

The first challenge here would be to identify the duplicate documents. For that you need to run a terms aggregation on the fields that defines the uniqueness of the document. On the second level of aggregation use top_hits to get the document ID too. Once you are there , you will get the ID of documents having duplicates.

Now you can safely remove them , may be using Bulk API.

You can read of other approaches to detect and remove duplicate documents here.

Vineeth Mohan
  • 18,633
  • 8
  • 63
  • 77