32

I have an index with a lot of paper with the same value for the same field. I have one deduplication on this field.

Aggregators will come to me as counters. I would like a list of documents.

My index :

  • Doc 1 {domain: 'domain1.fr', name: 'name1', date: '01-01-2014'}
  • Doc 2 {domain: 'domain1.fr', name: 'name1', date: '01-02-2014'}
  • Doc 3 {domain: 'domain2.fr', name: 'name2', date: '01-03-2014'}
  • Doc 4 {domain: 'domain2.fr', name: 'name2', date: '01-04-2014'}
  • Doc 5 {domain: 'domain3.fr', name: 'name3', date: '01-05-2014'}
  • Doc 6 {domain: 'domain3.fr', name: 'name3', date: '01-06-2014'}

I want this result (deduplication result by domain field) :

  • Doc 6 {domain: 'domain3.fr', name: 'name3', date: '01-06-2014'}
  • Doc 4 {domain: 'domain2.fr', name: 'name2', date: '01-04-2014'}
  • Doc 2 {domain: 'domain1.fr', name: 'name1', date: '01-02-2014'}
Dan Tuffery
  • 5,874
  • 29
  • 28
Bastien D
  • 1,395
  • 2
  • 14
  • 26

1 Answers1

36

You could use field collapsing, group the results on the name field and set the size of the top_hits aggregator to 1.

/POST http://localhost:9200/test/dedup/_search?search_type=count&pretty=true
{
  "aggs":{
    "dedup" : {
      "terms":{
        "field": "name"
       },
       "aggs":{
         "dedup_docs":{
           "top_hits":{
             "size":1
           }
         }
       }    
    }
  }
}

this returns:

{
  "took" : 192,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "dedup" : {
      "buckets" : [ {
        "key" : "name1",
        "doc_count" : 2,
        "dedup_docs" : {
          "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "1",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name1", date: "01-01-2014"}
          } ]
        }
      }
    }, {
      "key" : "name2",
      "doc_count" : 2,
      "dedup_docs" : {
        "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "3",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name2", date: "01-03-2014"}
          } ]
        }
      }
    }, {
      "key" : "name3",
      "doc_count" : 2,
      "dedup_docs" : {
        "hits" : {
          "total" : 2,
          "max_score" : 1.0,
          "hits" : [ {
            "_index" : "test",
            "_type" : "dedup",
            "_id" : "5",
            "_score" : 1.0,
            "_source":{domain: "domain1.fr", name: "name3", date: "01-05-2014"}
           } ]
         }
       }
     } ]
   }
 }
}
WillS
  • 362
  • 1
  • 12
Dan Tuffery
  • 5,874
  • 29
  • 28
  • However if my field value is like 'http://www.eyrolles.com/Loisirs/Livre/couture-printemps-ete-9782756522081' my terms of my buckets they are 'printemps', 'couture', '9782756522081'... Terms aggregatore split url by words... I don't want to split value. – Bastien D Aug 28 '14 at 08:51
  • That is a different question, you would need to index the field `not_analyzed` and reference that field instead in your aggregation. Have a look at multi-field types: http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/mapping-multi-field-type.html – Dan Tuffery Aug 28 '14 at 09:31
  • 1
    is there a way to decide which one among the duplicates ES will choose? say i have documents that i want to collapse on field1, but those documents have different field2 values, and i want to be able to arbitrarily choose which one? if it helps, in my specific case, i want to choose the last one inserted. – coffeeaddict May 28 '15 at 17:04
  • 2
    Where did you remove the doc? – Thomas Decaux Oct 21 '16 at 10:37
  • can we add date condition? to get duplicated for particular date range – Jeeva N Jun 28 '17 at 07:15
  • How to get distinct total records count while doing aggregation so that we can generate pagination in client side? – Karunaker Reddy V Aug 18 '17 at 09:23