0

I am trying to use ElasticSearch to perform a phrase search on a string field and I don't really understand the order that the results get returned in. I have a simple "match_phrase" query of the form:

GET /MyIndex/_search
{
  "query": 
  { 
    "match_phrase": 
    {
      "FieldToSearch": "find this phrase" 
    }
  }
}

So lets say I had documents that contained the following values for "FieldToSearch": ["This is the way to find this phrase", "find this phrase", "find this phrase to win a prize"]. I would expect it to return "find this phrase" before the other 2 results because it exactly matches the phrase that I am looking for. However, I've noticed that it sometimes lists something like "find this phrase to win a prize" first. Is there a way to return "exact matches" before results that contain an exact match?

Pierce Mason
  • 3
  • 1
  • 2

1 Answers1

0

Phrase "find this phrase" is too common for documents in your index. Essentially every document matches this search query and little differences in relevance are due to the field-length norm. As far as I know field-length norm is computed per shard. So when each of three documents of your index is located in its own shard you can see slightly surprising search results where relevance of the document with the shortest field is lower than others. You can test it by creating the index with only one primary shard. In that case document with field value "find this phrase" will get the highest score. Also you can achive the same result for several primary shards by disabling field-length norm:

PUT your_index/_mapping/your_type
{
  "properties": {
    "FieldToSearch": {
      "type": "text",
      "norms": false
    }
  }
}

But I think more accurate queries would be a better choise.

EDIT:

My point is just using more specific queries which contain relatively unique tokens. For example instead of querying phrase Jurassic Park that is contained in almost every document in your index it would be better to query World Jurassic Park that is contained in only one document.

However, there is a way to achieve the desired results for your example. Look at this question. You will need to change mapping to enable token counter on certain fields:

PUT your_index/_mapping/your_type
{
  "properties": {
    "FieldToSearch": { 
      "type": "text",
      "fields": {
        "length": { 
          "type": "token_count",
          "analyzer": "standard"
        }
      }
    }
  }
}

Then use function_score to boost relevance depending on the count of token that field contains:

GET your_index/your_type/_search
{
  "query": {
    "function_score": {
      "query": {"match_phrase": {
        "title": "Jurassic Park"
      }},
      "field_value_factor": {
        "field": "FieldToSearch.length",
        "modifier": "reciprocal"
      }
    }
  }
}

This way the documents with fields containing small number of tokens will get the higher score.

briarheart
  • 1,906
  • 2
  • 19
  • 33
  • Thanks, that explains why I get these weird results. I would prefer to avoid having to modify the index though because I am new to elasticsearch and do not know all of the ramifications of that. Can you give an example of how I could have a more accurate query in this case? For example, lets say I was searching a movie collection for all movies with "Jurassic Park" in the title. How would I write a query that would list "Jurassic Park" before it listed "The Lost World: Jurassic Park" and "Jurassic Park III" ? – Pierce Mason Feb 02 '18 at 14:53
  • @PierceMason I've just edited my answer. See the latest version please. – briarheart Feb 02 '18 at 18:08
  • Thanks, I was hoping there would be a way that I could simplify the query but it sounds like adding the token count may work. In this case the phrase comes from the user, so I can't change the contents of the phrase to be more relevant – Pierce Mason Feb 08 '18 at 16:11
  • @PierceMason OK. As a small supplement there is an [article](https://www.elastic.co/guide/en/elasticsearch/guide/2.x/relevance-is-broken.html) in the Elasticsearch guide that as it seems to me describes exactly your case. – briarheart Feb 08 '18 at 16:49
  • Thanks, that article helped explain things. We are in the process of setting up elasticsearch and don't have all of the data yet, so it seems that it is possible that this problem may end up being mitigated (according to that article) – Pierce Mason Feb 09 '18 at 18:22