Word-oriented completion suggester (ElasticSearch 5.x)

Question

ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:

Completion suggester is document-oriented

Suggestions are aware of the document they belong to. Now, associated documents (_source) are returned as part of completion suggestions.

In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.

Let's say we have this simple mapping:

{
   "my-index": {
      "mappings": {
         "users": {
            "properties": {
               "firstName": {
                  "type": "text"
               },
               "lastName": {
                  "type": "text"
               },
               "suggest": {
                  "type": "completion",
                  "analyzer": "simple"
               }
            }
         }
      }
   }
}

With a few test documents:

{
   "_index": "my-index",
   "_type": "users",
   "_id": "1",
   "_source": {
      "firstName": "John",
      "lastName": "Doe",
      "suggest": [
         {
            "input": [
               "John",
               "Doe"
            ]
         }
      ]
   }
},
{
   "_index": "my-index",
   "_type": "users",
   "_id": "2",
   "_source": {
      "firstName": "John",
      "lastName": "Smith",
      "suggest": [
         {
            "input": [
               "John",
               "Smith"
            ]
         }
      ]
   }
}

And a by-the-book query:

POST /my-index/_suggest?pretty
{
    "my-suggest" : {
        "text" : "joh",
        "completion" : {
            "field" : "suggest"
        }
    }
}

Which yields the following results:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "1",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Doe",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Doe"
                       ]
                    }
                 ]
               }
            },
            {
               "text": "John",
               "_index": "my-index",
               "_type": "users",
               "_id": "2",
               "_score": 1,
               "_source": {
                 "firstName": "John",
                 "lastName": "Smith",
                 "suggest": [
                    {
                       "input": [
                          "John",
                          "Smith"
                       ]
                    }
                 ]
               }
            }
         ]
      }
   ]
}

In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text property.

However, I would like to receive one (1) word. Something simple like this:

{
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "my-suggest": [
      {
         "text": "joh",
         "offset": 0,
         "length": 3,
         "options": [
          "John"
         ]
      }
   ]
}

Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.

Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?

EDIT: As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:

Keeping the new index in sync.
Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index: "John", "Doe", "David", "Smith". When querying for "John D", the result for the incomplete word should be "Doe" and not "Doe", "David".

To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.

As hinted at [in this issue](https://github.com/elastic/elasticsearch/issues/21676), this new behavior is "by design" and there's no plan to change it. Their suggestion is to create another index for the completion suggester. Pretty much as suggested by @EdgarVonk below. — Val, Jan 22 '17 at 05:28
What about a custom query on the current index? Maybe creating an additional NGram field for all the suggestions with a distinct query (with terms aggregation)? As for the additional suggestion-only index, I can identify a few issues, which actually contradict your proposed solution (see my updated question). — alesc, Jan 22 '17 at 08:51
Of course, a terms aggregation can also achieve a similar goal, but it depends on the load of documents you have. I'm not proposing that solution, Edgar and the ES folks (see issue) are ;-) — Val, Jan 22 '17 at 09:18

Val · Accepted Answer · 2019-06-28T13:31:09.167

20

As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname field containing ngrams of the field. First you define your mapping like this:

PUT my-index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "completion_analyzer": {
          "type": "custom",
          "filter": [
            "lowercase",
            "completion_filter"
          ],
          "tokenizer": "keyword"
        }
      },
      "filter": {
        "completion_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 24
        }
      }
    }
  },
  "mappings": {
    "users": {
      "properties": {
        "autocomplete": {
          "type": "text",
          "fields": {
            "raw": {
              "type": "keyword"
            },
            "completion": {
              "type": "text",
              "analyzer": "completion_analyzer",
              "search_analyzer": "standard"
            }
          }
        },
        "firstName": {
          "type": "text"
        },
        "lastName": {
          "type": "text"
        }
      }
    }
  }
}

Then you index a few documents:

POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }

Then you can query for joh and get one result for John and another one for Johnny

{
  "size": 0,
  "query": {
    "term": {
      "autocomplete.completion": "john d"
    }
  },
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "autocomplete.raw"
      }
    }
  }
}

Results:

{
  "aggregations": {
    "suggestions": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "John Doe",
          "doc_count": 1
        },
        {
          "key": "John Deere",
          "doc_count": 1
        }
      ]
    }
  }
}

UPDATE (June 25th, 2019):

ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

edited Jun 28 '19 at 13:31

answered Jan 23 '17 at 06:01

Val

207,596
13
358
360

What if you wanted to search through more than one field? Ideally, I would have an additional multi-value field named `suggest`, which would contain all of the values that I would like to autocomplete (name, surname, username, email etc.). – alesc Jan 23 '17 at 08:40
That would be the same thing, each of the tokens contained in that field will be indexed – Val Jan 23 '17 at 08:42
But would aggregation also work so that it would remove duplicate entries? – alesc Jan 23 '17 at 10:00
1

In a terms aggregation, you'll only ever get a single occurrence of each matching terms. There's no way you get two `John` buckets in the above aggregation. – Val Jan 23 '17 at 10:03
Your prototype works, but I will need to make a performance test to see if this solution is viable. Also, how do you propose to take subsequent words into account? Meaning that entering "johnny d" should **not** produce `Doe` and `Deere`, since no `johnny` has that surname? – alesc Jan 26 '17 at 18:07
Then we need another field with the contraction of the two and run a edgengram on that field. – Val Jan 26 '17 at 18:10
With the current mapping, this obviously wouldn't work. What about that we forget about first and last name and create a `my-autocomplete` field which has multiple entries (first name, last name etc.) and has the `raw` and `completion` subproperties like `firstName` in your example. Could you create a query that multiple words must all be contained between the ngrams in order to match? So that `c` in `johnny c` will only look for autocomplete matches where `johnny` also occurs. And for simplicity, assume that you only wish to complete the last word (words are space separated). – alesc Jan 26 '17 at 18:23
then we also need to account for the fact that the user might be typing the first name and the last name or the other way around. but if we decide that we need "first last", then yes it's possible, but it's not really different from what we have above, it's just that you need to concatenate the two fields into another one – Val Jan 26 '17 at 18:26
I've modified my example to add an `autocomplete` field that gives you the result you need. – Val Jan 27 '17 at 04:47
Do you think one could use `aggs` to somehow deduplicate results from the completion suggester? I know this isn't possible using the `_suggest` endpoint, but could this be achieved via the `_search` endpoint, where one can also query suggestions? [See first example](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html). – alesc Jan 28 '17 at 09:07

score 3 · Answer 2 · answered Sep 15 '17 at 06:39

3

An additional field skip_duplicates will be added in the next release 6.x.

From the docs at https://www.elastic.co/guide/en/elasticsearch/reference/master/search-suggesters-completion.html#skip_duplicates:

POST music/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "prefix" : "nor",
            "completion" : {
                "field" : "suggest",
                "skip_duplicates": true
            }
        }
    }
}

answered Sep 15 '17 at 06:39

Dries Cleymans

770
8
20

Please note that `"skip_duplicates": true` works like a charm, but only for ES6.1 and it could be the best solution. For ES6.0, which is my case, it does not work. – sashaegorov Jan 20 '18 at 09:21

score 1 · Answer 3 · answered Jan 21 '17 at 20:18

We face exactly the same problem. In Elasticsearch 2.4 the approach like you describe used to work fine for us but now as you say the suggester has become document-based while like you we are only interested in unique words, not in the documents.

The only 'solution' we could think of so far is to create a separate index just for the words on which we want to perform the suggestion queries and in this separate index make sure somehow that identical words are only indexed once. Then you could perform the suggestion queries on this separate index. This is far from ideal, if only because we will then need to make sure that this index remains in sync with the other index that we need for our other queries.

Could you elaborate how to create a mechanism to keep this index in sync? And how to avoid global suggestions for subsequent words? — alesc, Jan 22 '17 at 08:37

Word-oriented completion suggester (ElasticSearch 5.x)

3 Answers3

Linked