0

I want to implement autocomplete with elasticsearch and I'm unable to do it. I want something like this question here. I tried the suggested answers but in vain. I want to have something like the following :

My indexed strings are for e.g :

  • "Developpeur Java"
  • "Developpeur C#"
  • "Je suis Developpeur"
  • "Je suis écrivan"
  • "Il est developpeur C++"

For input "develo", I want as output :

  • "Developpeur"
  • "Developpeur Java"
  • "Developpeur C#"
  • "Developpeur C++"

For input "developpeur", I want as output :

  • "developpeur Java"
  • "developpeur C#"
  • "developpeur C++"

for input "suis", I want as ouput :

  • "suis developpeur"
  • "suis écrivan"

I tried to acheive this using completion suggester :

here's the elasticsearch I'm using :

"number": "6.2.2",
"build_hash": "10b1edd",
"build_date": "2018-02-16T19:01:30.685723Z",
"build_snapshot": false,
"lucene_version": "7.2.1",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"

the mapping :

{
"settings": {
    "number_of_shards": "1",
    "analysis": {
        "filter": {
            "prefix_filter": {
                "type": "edge_ngram",
                "min_gram": 1,
                "max_gram": 20
            },
            "ngram_filter": {
                "type": "nGram",
                "min_gram": "3",
                "max_gram": "3"
            },
            "synonym_filter": {
                "type": "synonym",
                "synonyms": [
                    "hackwillbereplacedatindexcreation,hackwillbereplacedatindexcreation"
                ]
            },
            "french_stop": {
                "type": "stop",
                "stopwords": "french"
            }
        },
        "analyzer": {
            "word": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "french_stop"
                ],
                "char_filter": []
            },
            "prefix": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "synonym_filter",
                    "prefix_filter"
                ],
                "char_filter": []
            },
            "ngram_with_synonyms": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "synonym_filter",
                    "ngram_filter"
                ],
                "char_filter": []
            },
            "ngram": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "ngram_filter"
                ],
                "char_filter": []
            }
        }
    }
},
"mappings": {
    "training": {
        "properties": {
            "id": {
                "type": "text",
                "index": false
            },
            "label": {
                "type": "text",
                "index_options": "docs",
                "copy_to": "full_label",
                "analyzer": "word",
                "fields": {
                    "prefix": {
                        "type": "text",
                        "index_options": "docs",
                        "analyzer": "prefix",
                        "search_analyzer": "word"
                    },
                    "ngram": {
                        "type": "text",
                        "index_options": "docs",
                        "analyzer": "ngram_with_synonyms",
                        "search_analyzer": "ngram"
                    }
                }
            },
            "labelSuggest": {
                "type": "completion",
                "analyzer": "word"
            },
        }
    }
}

Then when I create the index with my data I do this (this is the body of the put call made to the ES api, I'm using pyhon for this):

body = {
    "label": r["title"],
    "labelSuggest": {
        "input": r["title"].ngrams()
    },
    "weight": 1.
}

r["title"].ngrams() gets all the ngrams of the title. e.g : "Development research biotech" would give : "Development", "research", "biotech", "Development research", "research biotech" and "Development research biotech"

then to call the suggseter, I do :

   POST  http://localhost:9200/training/_search?pretty
{
    "suggest": {
        "labelSuggest": {
            "text": "developpeur",
            "completion": {
                "field": "labelSuggest",
                "skip_duplicates": true

            }
        }
    }
}

The result is :

{
    "text": "développement",
    "_index": "activity_20180518092449",
    "_type": "activity",
    "_id": "2031ce8b-6589-3270-afdf-7901aa21efa1",
    "_score": 1,
    "_source": {
        "id": "2031ce8b-6589-3270-afdf-7901aa21efa1",
        "name": "development research biotech",
        "labelSuggest": [
            "development",
            "research",
            "biotech",
            "development research",
            "research biotech",
            "development research biotech"
        ]
    }

But I want something that gives me : "development", "development research" and "development research biotech" (supposing we only have that document as input)

What is wrong with the mapping/query I'm doing ? Is-this the right way to do it ? I hope my question is clear. I searched a lot about it in vain.

Thanks in advance

Hamid Cherif
  • 171
  • 2
  • 16

1 Answers1

0

First of all teh Ngram won't do what you say.

this :

"ngram_filter": {
            "type": "nGram",
            "min_gram": "3",
            "max_gram": "3"
        },

will do this from "developpeur Java" -> dev,eve,vel,elo ... so on.

Check documentation here : Ngram Tokenizer

Second... for the result you want i will just use one custom analyzer that has filters "icu_folding" and "engram" and a whitespace tokenizer. Now the engram i will start it from 2 and a max of 20-25.

This will generate a list of tokens like this from "developpeur Java" -> de, dev, deve, devel, develo, developp, developpe, devellopeu, developper .. so on.

Then you do a simple term search on that field. If it's a dropdown for that autocomplete you will return records as you type. Hope i understood your problem and i hope this will help.

UPDATE: Try using this :

"suggester": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["my_ngram_filter", "icu_folding"],
"char_filter": []
}
"my_ngram_filter" is: "my_ngram_filter": {
    "type": "edge_ngram",
    "min_gram": "2",
    "max_gram": "20"
}

Then mapping on the field should look like

"labelSuggest": {
            "type": "text",
            "analyzer": "suggester"
        }

Then do a simple search

  {
  "query": {
    "term": {
      "labelSuggest": "dev" 
    }
   }
  }
danvasiloiu
  • 751
  • 7
  • 24
  • The ngrams() function I'm using is a custom function I defined in Python. I will try the filters and tokenizer you talked about. Thanks for your reply – Hamid Cherif May 23 '18 at 11:13
  • I create this analyser : `"suggester": { "type": "custom", "tokenizer": "whitespace", "filter": [ "my_ngram_filter", "asciifolding" ], "char_filter": [] }` Where "my_ngram_filter" is : `"my_ngram_filter": { "type": "nGram", "min_gram": "2", "max_gram": "20" },` This still doesn't work. – Hamid Cherif May 23 '18 at 13:17
  • man... use edge_ngram. not ngram. i told you what the ngram does. also put the search part. – danvasiloiu May 24 '18 at 08:56
  • Yes I tried with that filter and it doen't work as I expected it. I tried two queries : `{ "query":{ "match": { "labelSuggest":"dev" } } }` and `{ "suggest": { "labelSuggest" : { "prefix" : "dev", "completion" : { "field" : "labelSuggest", "skip_duplicates": true } } } }` Both don't work as I expect – Hamid Cherif May 24 '18 at 09:15
  • I expect those queries to give me the following results : - "Developpeur" - "Developpeur Java" - "Developpeur C#" - "Developpeur C++" But they give me : - "Developpeur Java" - "Developpeur C#" - "Developpeur C++" – Hamid Cherif May 24 '18 at 09:17
  • Thanks for the more detailed answer. I just tried that but it gives me back the whole label as result : It gave me "Developpeur Java" - "Developpeur C#" - "Developpeur C++" It didn't give me "Developpeur" then the others as I expected – Hamid Cherif May 24 '18 at 11:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/171693/discussion-between-hamid-and-danvasiloiu). – Hamid Cherif May 24 '18 at 12:01