2

I have the problem that some documents are indexed twice or more so I want to filter out this duplicates when searching. I followed some other threads and built this query:

{
  "query" : { ... },
  "size" : 10,
  "from" : 0,
  "sort" : { ... },
  "aggs" : {
    "dedup" : {
      "terms" : {
        "field" : "content.keyword"
      },
      "aggs" : {
        "dedup_docs" : {
          "top_hits" : {
            "size" : 1
          }
        }
      }
    }
  }
}

But it seems that this aggregation has no effect. I'm still getting duplicate results (documents with the same text in the content field).

Request changed:

{
  "query" : { ... },
  "size" : 10,
  "from" : 0,
  "sort" : { ... },
  "collapse" : {
    "field" : "content.keyword"
  }
}
altralaser
  • 2,035
  • 5
  • 36
  • 55
  • Are you trying to filter out duplicate **aggregations** or duplicate document results? – aclowkay Jul 06 '17 at 07:28
  • I want to filter out duplicate document results, let's say documents with the same title or the same text content. – altralaser Jul 06 '17 at 07:49
  • Are you using the answer from here https://stackoverflow.com/questions/25448186/remove-duplicate-documents-from-a-search-in-elasticsearch ? Did you run on the right endpoint? /_search?search_type=count Did you look for the results in the aggregations and **not** in the _hits_ array? – aclowkay Jul 06 '17 at 08:11

1 Answers1

4

You can also take a look at the recently added field collapsing feature

alr
  • 1,744
  • 1
  • 10
  • 11
  • Collapsing fields seems to be a good idea and I changed my query (please have a look on my post). But if I execute it, I will get the error: "Unknown key for a START_OBJECT in [collapse]." – altralaser Jul 06 '17 at 11:30
  • field collapsing is only supported in ES 5.4 and later... you might need to upgrade – alr Jul 07 '17 at 07:04
  • I updated my elastic instance and it works perfect. Thanks a lot! – altralaser Jul 07 '17 at 13:09
  • 1
    Field collapsing works good but is it possible to get the number of total hits (count without duplicates)? – altralaser Jul 11 '17 at 15:34
  • No, as mentioned on the docs page: The total number of hits in the response indicates the number of matching documents without collapsing. The total number of distinct group is unknown. – alr Jul 12 '17 at 06:36