elasticsearch match_phrase query for exact sub-string search

Question

I used match_phrase query for search full-text matching.

But it did not work as I thought.

Query:

POST /_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "browsing_url": "/critical-illness"
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

Results:

"hits" : [
      {
        "_source" : {
          "browsing_url" : "https://www.google.com/url?q=https://industrytoday.co.uk/market-research-industry-today/global-critical-illness-commercial-insurance-market-to-witness-a-pronounce-growth-during-2020-2025&usg=afqjcneelu0qvjfusnfjjte1wx0gorqv5q"
        }
      },
      {
        "_source" : {
          "browsing_url" : "https://www.google.com/search?q=critical+illness"
        }
      },
      {
        "_source" : {
          "browsing_url" : "https://www.google.com/search?q=critical+illness&tbm=nws"
        }
      },
      {
        "_source" : {
          "browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness+-insurance%3f"
        }
      },
      {
        "_source" : {
          "browsing_url" : "https://www.google.com/search?q=do+i+have+a+critical+illness%3f"
        }
      }
    ]

expectation:

To only get results where the given string is an exact sub-string in the field. For example:

https://www.example.com/critical-illness OR
https://www.example.com/critical-illness-insurance

Mapping:

"browsing_url": {
  "type": "text",
  "norms": false,
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

The results are not what I expected. I expected to get the results exactly as the search /critical-illness as a substring of the stored text.

Syntactic Fructose · Answer 1 · 2020-06-19T19:15:13.417

The reason you're seeing unexpected results is because both your search query, and the field itself, are being run through an analyzer. Analyzers will break down text into a list of individual terms that can be searched on. Here's an example using the _analyze endpoint:

GET _analyze
{
  "analyzer": "standard",
  "text": "example.com/critical-illness"
}

{
  "tokens" : [
    {
      "token" : "example.com",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "critical",
      "start_offset" : 12,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "illness",
      "start_offset" : 21,
      "end_offset" : 28,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

So while your documents true value is example.com/critical-illness, behind the scenes Elasticsearch will only use this list of tokens for matches. The same thing goes for your search query since you're using match_phrase, which tokenizes the phrase passed in. The end result is Elasticsearch trying to match the token list ["critical", "illness"] against your documents token lists.

Most of the time the standard analyzer does a good job of removing unnecessary tokens, however in your case you care about characters like / since you want to match against them. One way to solve this is to use a different analyzer like a reversed path hierarchy analyzer. Below is an example of how to configure this analyzer and use it for your browsing_url field:

PUT /browse_history
{
  "settings": {
    "analysis": {
      "analyzer": {
        "url_analyzer": {
          "tokenizer": "url_tokenizer"
        }
      },
      "tokenizer": {
        "url_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "/",
          "reverse": true
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "browsing_url": {
        "type": "text",
        "norms": false,
        "analyzer": "url_analyzer",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

Now if you analyze a URL you'll now see URL paths kept whole:

GET browse_history/_analyze
{
  "analyzer": "url_analyzer",
  "text": "example.com/critical-illness?src=blah"
}

{
  "tokens" : [
    {
      "token" : "example.com/critical-illness?src=blah",
      "start_offset" : 0,
      "end_offset" : 37,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "critical-illness?src=blah",
      "start_offset" : 12,
      "end_offset" : 37,
      "type" : "word",
      "position" : 0
    }
  ]
}

This lets you do a match_phrase_prefix to find all documents with URLs that contain a critical-illness path:

POST /browse_history/_search
{
  "query": {
    "match_phrase_prefix": {
      "browsing_url": "critical-illness"
    }
  }
}

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.7896894,
    "hits" : [
      {
        "_index" : "browse_history",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.7896894,
        "_source" : {
          "browsing_url" : "https://www.example.com/critical-illness"
        }
      }
    ]
  }
}

EDIT:

Previous answer before revision was to use the keyword field and a regexp, however this is a pretty costly query to make.

POST /browse_history/_search
{
  "query": {
    "regexp": {
      "browsing_url.keyword": ".*/critical-illness"
    }
  }
}

Hey, thank you for the explanation. I did try this with the **wildcard query** and it works. But I have tens of searches inside `should` and the performance was extremely horrible and mostly did not work. It actually started to timeout with: `org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@980a379 on QueueResizingEsThreadPoolExecutor`. So with a lot of search queries, I don't think regex or wildcard is the right way to do this. — Ankit, Jun 19 '20 at 07:00
Is there any other way, I can match the exact substring in the `browsing_url` field apart from `regex` or `wildcard`? — Ankit, Jun 19 '20 at 07:04
@Ankit I rewrote my answer to use an analyzer instead of `regexp`. While the query should be much faster, it requires you to re-index your documents with the analyzer mentioned above. — Syntactic Fructose, Jun 19 '20 at 19:16
If this new approach works for you please hit the green checkmark on the left of my answer to accept it :) — Syntactic Fructose, Jun 19 '20 at 19:18

elasticsearch match_phrase query for exact sub-string search

1 Answers1