4

In Elasticsearch, how do I search for an arbitrary substring, perhaps including spaces? (Searching for part of a word isn't quite enough; I want to search any substring of an entire field.)

I imagine it has to be in a keyword field, rather than a text field.

Suppose I have only a few thousand documents in my Elasticsearch index, and I try:

  "query": {
         "wildcard" : { "description" : "*plan*" }
  }

That works as expected--I get every item where "plan" is in the description, even ones like "supplantation".

Now, I'd like to do

  "query": {
         "wildcard" : { "description" : "*plan is*" }
  }   

...so that I might match documents with "Kaplan isn't" among many other possibilities.

It seems this isn't possible with wildcard, match prefix, or any other query type I might see. How do I simply search on any substring? (In SQL, I would just do description LIKE '%plan is%')

(I am aware any such query would be slow or perhaps even impossible for large data sets.)

Patrick Szalapski
  • 8,738
  • 11
  • 67
  • 129
  • You need to tokenize your description, in order to search for separate words. Have a read in their documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html – cheffe Jun 28 '17 at 07:45
  • If you really want to search for an arbitraty substring, you need to go for ngrams: https://www.elastic.co/guide/en/elasticsearch/guide/current/_ngrams_for_partial_matching.html – cheffe Jun 28 '17 at 07:46
  • Possible duplicate of [How to search for a part of a word with ElasticSearch](https://stackoverflow.com/questions/6467067/how-to-search-for-a-part-of-a-word-with-elasticsearch) – cheffe Jun 28 '17 at 07:48

2 Answers2

1

Have you tried the regxp query in elasticsearch? It sure does sound like something you might be interested in.

Jai Sharma
  • 713
  • 1
  • 4
  • 17
1

I was hoping there might be something built-in for this Elasticsearch, given that this simple substring search seems like a very basic capability (Thinking about it, it is implemented as strstr() in C, LIKE '%%' in SQL, Ctrl+F in most text editors, String.IndexOf in C#, etc.), but this seems not to be the case. Note that the regexp query doesn't support case insensitivity, so I also needed to pair it with this custom analyzer, so that the index matches all-lowercase. Then I can convert my search string to lowercase as well.

{
  "settings": {
    "analysis": {
      "analyzer": {
        "lowercase_keyword": { 
          "type": "custom",
          "tokenizer": "keyword", 
          "filter": [ "lowercase" ] 
        }
      }
    }
  },
  "mappings": { 
     ...
     "description": {"type": "text", "analyzer": "lowercase_keyword"},
  }
}

Example query:

  "query": {
         "regexp" : { "description" : ".*plan is.*" }
  }

Thanks to Jai Sharma for leading me; I just wanted to provide more detail.

Patrick Szalapski
  • 8,738
  • 11
  • 67
  • 129
  • 1
    This is correct, but with field longer than 32766 it does not work: original message: bytes can be at most 32766 in length; got 32804","caused_by":{"type":"max_bytes_length_exceeded_exception","reason":"bytes can be at most 32766 in length; got 32804 any workarounds? –  Jul 25 '17 at 13:38
  • Got it, so, keep any value under 32K. I assume it is UTF-8 by default? – Patrick Szalapski Jul 25 '17 at 17:40
  • i cant... theres a lot of workarounds but not a solution. –  Aug 09 '17 at 22:58
  • I still do not see a way to do a simple substring search except for via regexp on a keyword field with values under 32K. I agree, it is limiting. – Patrick Szalapski Sep 22 '17 at 18:58
  • I have filed a elasticsearch issue here: https://github.com/elastic/elasticsearch/issues/26759 – Patrick Szalapski Sep 22 '17 at 19:20