161

I've recently started using ElasticSearch and I can't seem to make it search for a part of a word.

Example: I have three documents from my couchdb indexed in ElasticSearch:

{
  "_id" : "1",
  "name" : "John Doeman",
  "function" : "Janitor"
}
{
  "_id" : "2",
  "name" : "Jane Doewoman",
  "function" : "Teacher"
}
{
  "_id" : "3",
  "name" : "Jimmy Jackal",
  "function" : "Student"
} 

So now, I want to search for all documents containing "Doe"

curl http://localhost:9200/my_idx/my_type/_search?q=Doe

That doesn't return any hits. But if I search for

curl http://localhost:9200/my_idx/my_type/_search?q=Doeman

It does return one document (John Doeman).

I've tried setting different analyzers and different filters as properties of my index. I've also tried using a full blown query (for example:

{
  "query": {
    "term": {
      "name": "Doe"
    }
  }
}

) But nothing seems to work.

How can I make ElasticSearch find both John Doeman and Jane Doewoman when I search for "Doe" ?

UPDATE

I tried to use the nGram tokenizer and filter, like Igor proposed, like this:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "bulk_size": "100",
    "bulk_timeout": "10ms",
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "my_ngram_tokenizer",
          "filter": [
            "my_ngram_filter"
          ]
        }
      },
      "filter": {
        "my_ngram_filter": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 1,
          "max_gram": 1
        }
      }
    }
  }
}

The problem I'm having now is that each and every query returns ALL documents. Any pointers? ElasticSearch documentation on using nGram isn't great...

Saeed Zhiany
  • 2,051
  • 9
  • 30
  • 41
ldx
  • 2,536
  • 2
  • 18
  • 27

11 Answers11

93

I'm using nGram, too. I use standard tokenizer and nGram just as a filter. Here is my setup:

{
  "index": {
    "index": "my_idx",
    "type": "my_type",
    "analysis": {
      "index_analyzer": {
        "my_index_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "mynGram"
          ]
        }
      },
      "search_analyzer": {
        "my_search_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "mynGram"
          ]
        }
      },
      "filter": {
        "mynGram": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 50
        }
      }
    }
  }
}

Let's you find word parts up to 50 letters. Adjust the max_gram as you need. In german words can get really big, so I set it to a high value.

Saeed Zhiany
  • 2,051
  • 9
  • 30
  • 41
roka
  • 1,677
  • 1
  • 15
  • 21
  • 24
    [n-grams can waste memory if you're not careful; the min_gram and max_gram analyzer settings should be enough to narrow searches down to one record, and no more (a max_gram of 15 over a name is probably wasteful, since very few names share a substring that long).](http://blog.bignerdranch.com/1640-getting-fancy-with-elasticsearch/) – rthbound Dec 18 '13 at 23:17
  • Is that what you get from the settings of the index or is that what you post to elasticsearch to configure it? – Tomas Jansson Jan 29 '14 at 14:31
  • It's a POST to configure Elasticsearch. – roka Jan 31 '14 at 09:46
  • I’m not firm with current versions of Elasticsearch, but should mention it in the docs: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html – roka Feb 13 '16 at 11:00
  • Does this solution work for languages like Chineese, Korea etc...? – Nidhin David Mar 02 '16 at 13:57
  • As it is just some characters, it should work the same. You should adjust the `max_gram` to your language to improve the performance. I’m not firm with these languages though. – roka Mar 06 '16 at 12:07
  • @roka I followed your answer stricktly as far as I can see and it isn't working. Kindly, if you have the chance to see it https://stackoverflow.com/questions/61689741/desire-feature-of-searching-for-part-of-word-in-elasticsearch-returning-nothing – Jim C May 08 '20 at 23:42
  • 1
    @JimC I haven’t used ElasticSearch for at least 7 years, so I don’t know the current changes of the project. – roka May 11 '20 at 04:45
75

I think there's no need to change any mapping. Try to use query_string, it's perfect. All scenarios will work with default standard analyzer:

We have data:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

Scenario 1:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Doe*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

Scenario 2:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*Jan*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}

Scenario 3:

{"query": {
    "query_string" : {"default_field" : "name", "query" : "*oh* *oe*"}
} }

Response:

{"_id" : "1","name" : "John Doeman","function" : "Janitor"}
{"_id" : "2","name" : "Jane Doewoman","function" : "Teacher"}

EDIT - Same implementation with spring data elastic search https://stackoverflow.com/a/43579948/2357869

One more explanation how query_string is better than others https://stackoverflow.com/a/43321606/2357869

Vijay
  • 4,694
  • 1
  • 30
  • 38
  • 3
    i think this is the easiest – Esgi Dendyanri May 26 '17 at 11:59
  • Yes . I have implemented in my project . – Vijay May 26 '17 at 13:24
  • How to include multiple fields to search in? – Shubham A. Jun 02 '17 at 10:57
  • 1
    try this :-{ "query": { "query_string" : { "fields" : ["content", "name"], "query" : "this AND that" } } } – Vijay Jun 02 '17 at 10:58
  • check this link https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html – Vijay Jun 02 '17 at 10:59
  • @VijayGupta I tried this. Problem is that if `content` contains `this` and `name` contains `AND`, result is still returned. The `this AND that` should only match as a substring completely, but it seems it is divided into tokens here. – Shubham A. Jun 02 '17 at 11:39
  • @VijayGupta Like with your sample data, if you search for `jo ja`, both of the results will be returned. How to solve this? – Shubham A. Jun 02 '17 at 11:46
  • yes it will divide into tokens . might be u r looking for this question https://stackoverflow.com/q/43913595/2357869 or try to use match phrase query – Vijay Jun 02 '17 at 11:51
  • am terms of processing speed, how does this scale up? – Litwos May 21 '18 at 09:43
  • @VijayGupta what about search performance if I have large data in my es and I am using a query like "*Doe*" then can it search performance? – Suraj Dalvi Oct 31 '18 at 06:05
  • @Suraj Dalvi , we are uning native function of ES , i dont think we will get any issue regarding performnce . I have not faced any issue . – Vijay Oct 31 '18 at 12:35
  • @Litwos , processing speed is good , i havent get any issue till now . – Vijay Oct 31 '18 at 13:12
  • 3
    how come the scenario 2 is giving Johan doeman instead of Jane Doewoman – gaurav kumar Sep 06 '20 at 19:42
  • 1
    This worked for me. Thanks – DJ Burb Oct 19 '21 at 15:49
  • 1
    Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false. Taken from: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-query-string-query.html – mfaisalhyder Aug 03 '23 at 15:15
70

Searching with leading and trailing wildcards is going to be extremely slow on a large index. If you want to be able to search by word prefix, remove leading wildcard. If you really need to find a substring in a middle of a word, you would be better of using ngram tokenizer.

imotov
  • 28,277
  • 3
  • 90
  • 82
  • 16
    Igor is right. At least remove the leading *. For NGram ElasticSearch example, see this gist: https://gist.github.com/988923 – karmi Jun 24 '11 at 19:08
  • 3
    @karmi: Thanks for your complete example! Perhaps you want to add your comment as an actual answer, it's what got it working for me and what I would want to upvote. – Fabian Steeg Nov 12 '12 at 15:46
16

without changing your index mappings you could do a simple prefix query that will do partial searches like you are hoping for

ie.

{
  "query": { 
    "prefix" : { "name" : "Doe" }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html

8

While there are a lot of answers which focuses on solving the issue at hand but don't talk much about the various trade-off which someone needs to make before choosing a particular answer. So let me try to add a few more details on this perspective.

Partial search is now a day a very common and important feature and if not implemented properly can lead to poor user experience and bad performance, so first know your application function and non-function requirement related to this feature which I talked about in my this detailed SO answer.

Now there are various approaches, like query time, index time, completion suggester and search as you type data-types added in recent version of elasticsarch.

Now people who quickly want to just implement a solution can use below end to end working solution.

Index mapping

{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "ngram",
          "min_gram": 1,
          "max_gram": 10
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    },
    "index.max_ngram_diff" : 10
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "autocomplete", 
        "search_analyzer": "standard" 
      }
    }
  }
}

Index given sample docs

{
  "title" : "John Doeman"
  
}

{
  "title" : "Jane Doewoman"
  
}

{
  "title" : "Jimmy Jackal"
  
}

And search query

{
    "query": {
        "match": {
            "title": "Doe"
        }
    }
}

which returns expected search results

 "hits": [
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.76718915,
                "_source": {
                    "title": "John Doeman"
                }
            },
            {
                "_index": "6467067",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.76718915,
                "_source": {
                    "title": "Jane Doewoman"
                }
            }
        ]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Amit
  • 30,756
  • 6
  • 57
  • 88
  • In the mapping, `"search_analyzer": "standard"` the "standard" search analyzer is important. I was using a "lowercase" filter, and the values I was searching had digits in them. Digits are ignored by "lowercase" filter. – JessieinAg Nov 07 '22 at 15:51
6

Try the solution with is described here: Exact Substring Searches in ElasticSearch

{
    "mappings": {
        "my_type": {
            "index_analyzer":"index_ngram",
            "search_analyzer":"search_ngram"
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },
                "search_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": "lowercase"
                }
            }
        }
    }
}

To solve the disk usage problem and the too-long search term problem short 8 characters long ngrams are used (configured with: "max_gram": 8). To search for terms with more than 8 characters, turn your search into a boolean AND query looking for every distinct 8-character substring in that string. For example, if a user searched for large yard (a 10-character string), the search would be:

"arge ya AND arge yar AND rge yard.

karel
  • 5,489
  • 46
  • 45
  • 50
uı6ʎɹnɯ ꞁəıuɐp
  • 3,431
  • 3
  • 40
  • 49
  • 4
    dead link, pls fix – DarkMukke Sep 12 '17 at 11:45
  • I have been looking for something like this for a while. Thank you! Do you know how the memory scales with the `min_gram` and `max_gram` it seems like it would be linearly dependent on the size of the field values and the range of `min` and `max`. How frowned upon is using something like this? – Glen Thompson Oct 26 '19 at 16:34
  • Also is there any reason that the `ngram` is a filter over a tokenizer? could you not just have it as a tokenizer and then apply a lowercase filter... `index_ngram: { type: "custom", tokenizer: "ngram_tokenizer", filter: [ "lowercase" ] }` I tried it and it seems to give the same results using the analyzer test api – Glen Thompson Oct 26 '19 at 17:49
  • Used wayback machine https://web.archive.org/web/20131216221809/http://blog.rnf.me/2013/exact-substring-search-in-elasticsearch.html – Pants Oct 03 '21 at 15:21
5

I am using this and got I worked

"query": {
        "query_string" : {
            "query" : "*test*",
            "fields" : ["field1","field2"],
            "analyze_wildcard" : true,
            "allow_leading_wildcard": true
        }
    }
saravanavelu
  • 494
  • 6
  • 13
  • 1
    Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match. Leading wildcards can be disabled by setting allow_leading_wildcard to false. Taken from: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/query-dsl-query-string-query.html – mfaisalhyder Aug 03 '23 at 15:12
4

If you want to implement autocomplete functionality, then Completion Suggester is the most neat solution. The next blog post contains a very clear description how this works.

In two words, it's an in-memory data structure called an FST which contains valid suggestions and is optimised for fast retrieval and memory usage. Essentially, it is just a graph. For instance, and FST containing the words hotel, marriot, mercure, munchen and munich would look like this:

enter image description here

Neshta
  • 2,605
  • 2
  • 27
  • 45
3

you can use regexp.

{ "_id" : "1", "name" : "John Doeman" , "function" : "Janitor"}
{ "_id" : "2", "name" : "Jane Doewoman","function" : "Teacher"  }
{ "_id" : "3", "name" : "Jimmy Jackal" ,"function" : "Student"  } 

if you use this query :

{
  "query": {
    "regexp": {
      "name": "J.*"
    }
  }
}

you will given all of data that their name start with "J".Consider you want to receive just the first two record that their name end with "man" so you can use this query :

{
  "query": { 
    "regexp": {
      "name": ".*man"
    }
  }
}

and if you want to receive all record that in their name exist "m" , you can use this query :

{
  "query": { 
    "regexp": {
      "name": ".*m.*"
    }
  }
}

This works for me .And I hope my answer be suitable for solve your problem.

1

Using wilcards (*) prevent the calc of a score

Dardino
  • 154
  • 1
  • 7
  • 2
    Could you add more details to your answer? Provide a sample code or reference to documentation on what this does. – Cray Jul 01 '19 at 16:07
-4

Nevermind.

I had to look at the Lucene documentation. Seems I can use wildcards! :-)

curl http://localhost:9200/my_idx/my_type/_search?q=*Doe*

does the trick!

ldx
  • 2,536
  • 2
  • 18
  • 27
  • 12
    See @imotov answer. The use of wildcards is not going to scale well at all. – Mike Munroe Jun 05 '12 at 11:19
  • 5
    @Idx - See how your own answer is downvoted. Downvotes represents how quality and relevancy of an answer. Could you spare a minute to accept the right answer? At least new users would be grateful to you. – asyncwait Dec 26 '13 at 14:43
  • 3
    Enough downvotes. OP made clear what the best answer is now. +1 for sharing what seemed to be the best answer before someone posted a better one. – s.Daniel Mar 17 '15 at 10:05