5

I have an index with multiple fields in it. I want to filter out based on presence of search string in all the fields except one - user_comments. The query search that I am doing is

{
    "from": offset,
    "size": limit,
    "_source": [
      "document_title"
    ],
    "query": {
      "function_score": {
        "query": {
          "bool": {
            "must":
            {
              "query_string": {
                "query": "#{query}"
              }
            }
          }
        }
      }
    }
  }

Although the query string is searching through all the fields, and giving me documents with matching string in the user_comments field as well. But, I want to query it against all the fields leaving out the user_comments field. The white-list is a very big list and also the name of the fields are dynamic, so it is not feasible to mention the white-listed field list using the fields parameter like.

"query_string": {
                    "query": "#{query}",
                    "fields": [
                      "document_title",
                      "field2"
                    ]
                  }

Can anybody please suggest an idea on how to exclude a field from being searched?

Richa Sinha
  • 1,406
  • 15
  • 29
  • I think with the ElasticSearch 1.4 it was possible using partial_fields. Something similar to what is mentioned in this answer - https://stackoverflow.com/a/31713773/4066118 – Richa Sinha Oct 11 '18 at 10:12

2 Answers2

7

There is a way to make it work, it's not pretty but will do the job. You may achieve your goal using a boost and multifield parameters of query_string, bool query to combine the scores and setting min_score:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "#{query}",
            "type": "most_fields",
            "boost": 1
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "#{query}",
            "boost": -1
          }
        }
      ]
    }
  },
  "min_score": 0.00001
}

So what happens under the hood?

Let's assume you have the following set of documents:

PUT my-query-string/doc/1
{
  "title": "Prodigy in Bristol",
  "text": "Prodigy in Bristol",
  "comments": "Prodigy in Bristol"
}
PUT my-query-string/doc/2
{
  "title": "Prodigy in Birmigham",
  "text": "Prodigy in Birmigham",
  "comments": "And also in Bristol"
}
PUT my-query-string/doc/3
{
  "title": "Prodigy in Birmigham",
  "text": "Prodigy in Birmigham and Bristol",
  "comments": "And also in Cardiff"
}
PUT my-query-string/doc/4
{
  "title": "Prodigy in Birmigham",
  "text": "Prodigy in Birmigham",
  "comments": "And also in Cardiff"
}

In your search request you would like to see only documents 1 and 3, but your original query will return 1, 2 and 3.

In Elasticsearch, search results are sorted by relevance _score, the bigger the score the better.

So let's try to boost down the "comments" field so its impact into relevance score is neglected. We can do this by combining two queries with a should and using a negative boost:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "Bristol"
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "Bristol",
            "boost": -1
          }
        }
      ]
    }
  }
}

This will give us the following output:

{
  "hits": {
    "total": 3,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham and Bristol",
          "comments": "And also in Cardiff"
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "2",
        "_score": 0,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham",
          "comments": "And also in Bristol"
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "1",
        "_score": 0,
        "_source": {
          "title": "Prodigy in Bristol",
          "text": "Prodigy in Bristol",
          "comments": "Prodigy in Bristol",
          "discount_percent": 10
        }
      }
    ]
  }
}

Document 2 has got penalized, but also document 1 did, although it is a desired match for us. Why did it happen?

Here's how Elasticsearch computed _score in this case:

_score = max(title:"Bristol", text:"Bristol", comments:"Bristol") - comments:"Bristol"

Document 1 matches the comments:"Bristol" part and it also happens to be the best score. According to our formula the resulting score is 0.

What we would actually like to do is to boost first clause (with "all" fields) more if more fields matched.

Can we boost query_string matching more fields?

We can, query_string in multifield mode has a type parameter that does exactly that. The query will look like this:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "type": "most_fields",
            "query": "Bristol"
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "Bristol",
            "boost": -1
          }
        }
      ]
    }
  }
}

This will give us the following output:

{
  "hits": {
    "total": 3,
    "max_score": 0.57536423,
    "hits": [
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "1",
        "_score": 0.57536423,
        "_source": {
          "title": "Prodigy in Bristol",
          "text": "Prodigy in Bristol",
          "comments": "Prodigy in Bristol",
          "discount_percent": 10
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham and Bristol",
          "comments": "And also in Cardiff"
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "2",
        "_score": 0,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham",
          "comments": "And also in Bristol"
        }
      }
    ]
  }
}

As you can see, the undesired document 2 is on the bottom and has score of 0. Here's how the score was computed this time:

_score = sum(title:"Bristol", text:"Bristol", comments:"Bristol") - comments:"Bristol"

So the documents matching "Bristol" in any field got selected. Relevance score for comments:"Bristol" got eliminated, and only documents matching title:"Bristol" or text:"Bristol" got a _score > 0.

Can we filter out those results with undesired score?

Yes, we can, using min_score:

POST my-query-string/doc/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "query_string": {
            "query": "Bristol",
            "type": "most_fields",
            "boost": 1
          }
        },
        {
          "query_string": {
            "fields": [
              "comments"
            ],
            "query": "Bristol",
            "boost": -1
          }
        }
      ]
    }
  },
  "min_score": 0.00001
}

This will work (in our case) since the score of the documents will be 0 if and only if "Bristol" was matched against field "comments" only and didn't match any other field.

The output will be:

{
  "hits": {
    "total": 2,
    "max_score": 0.57536423,
    "hits": [
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "1",
        "_score": 0.57536423,
        "_source": {
          "title": "Prodigy in Bristol",
          "text": "Prodigy in Bristol",
          "comments": "Prodigy in Bristol",
          "discount_percent": 10
        }
      },
      {
        "_index": "my-query-string",
        "_type": "doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Prodigy in Birmigham",
          "text": "Prodigy in Birmigham and Bristol",
          "comments": "And also in Cardiff"
        }
      }
    ]
  }
}

Can it be done in a different way?

Sure. I wouldn't actually advise to go with _score tweaking since it is a pretty complex matter.

I would advise to make a fetch of existing mapping and construct a list of fields to run the query against beforehand, this will make the code much simpler and straightforward.

Original solution proposed in the answer (kept for history)

Originally it was proposed to use this kind of query with exactly the same intent as the solution above:

POST my-query-string/doc/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": {
            "query_string": {
              "fields" : ["*", "comments^0"],
              "query": "#{query}"
            }
          }
        }
      }
    }
  },
  "min_score": 0.00001
}

The only problem is that if an index contains any numeric values, this part:

"fields": ["*"]

raises an error since textual query string cannot be applied to a number.

Nikolay Vasiliev
  • 5,656
  • 22
  • 31
  • The **^0** is working although when I add the wildcard character the query creation fails "fields" : ["*", "comments^0"]. Is there any other way to specify all fields. Also I believe the min_score should be set to something greater than 0, otherwise it will pick comments as well as it's score would be 0. I am keeping 0.000001 as the minimum score. And i am using function score because in my actual query I have boost applied. – Richa Sinha Oct 11 '18 at 15:07
  • @RichaSinha Which version of ES are you running on? I checked the code above with 6.3 and it was working. – Nikolay Vasiliev Oct 11 '18 at 15:47
  • I am on ES version 6.1 – Richa Sinha Oct 11 '18 at 15:51
  • @RichaSinha `> Also I believe the min_score should be set to something greater than 0` - thanks, there was a typo in my answer, of course it should be smth small but greater than 0. `I am on ES version 6.1` - just tried with ES 6.0, `"fields" : ["*", "comments^0"],` works, may you show me the exact error message? – Nikolay Vasiliev Oct 11 '18 at 16:16
  • the error says that - "Can only use prefix queries on keyword and text fields - not on [object.discount_percent] which is of type [double]". Can you try to put a field of decimal type with type as "keyword" while putting to index. – Richa Sinha Oct 12 '18 at 05:07
  • @RichaSinha I managed to reproduce the problem, thanks! It looks like an ES bug. In the meantime I also managed to find a way to make it work, please check the answer, I updated it. – Nikolay Vasiliev Oct 12 '18 at 18:07
  • 1
    Thank you that solved my issue. Is this bug there in the latest version as well, if yes I think this should be raised to them. – Richa Sinha Oct 14 '18 at 15:05
  • one more query, why are we using most_fields? it combines all the scores together if I am not wrong, how is that helping us? Also it works fine by using the default type i.e. best_fields – Richa Sinha Oct 15 '18 at 05:21
  • @Richa Sinha we use `most_fields` to ask ES to give us a sum of all scores of all matched fields, so more matched fields will mean higher score (see [docs](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/query-dsl-multi-match-query.html#multi-match-types)). Theres a subtle case where `best_fields` won't work: there's a match in the `text` with score say 5 and in the `comments` with score 6... The best will be 6 and will be eliminated (since we subtract the score of `comments`). – Nikolay Vasiliev Oct 15 '18 at 12:04
  • thnx @Nikolay that explains it – Richa Sinha Oct 16 '18 at 05:51
0

The way you are searching, ES will look matches in the _all field. To exclude one field, you can disable _all field for user comments.

Reference - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html#enabling-all-field

For ES 6.x it can be replicated using copy_to

https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html

xrage
  • 4,690
  • 4
  • 25
  • 31
  • That is correct, the way I have added it will search in all the fields. And if I am not wrong the **copy_to** will create a duplicate of the same data. I have about 500 of such fields with about millions of values for each field. The duplication will make the search un-necessarily slow – Richa Sinha Oct 11 '18 at 10:43
  • Another way could be using `enabled: false` for user_comments, I have not tested it but based on docs, Seems it just stores the value in the index but don't run any query on this. https://www.elastic.co/guide/en/elasticsearch/reference/6.4/enabled.html – xrage Oct 11 '18 at 11:06
  • That could work but I have a seperate query to just search from **user_comments** field. So by making it enabled:false, I won't be able to make any search on this field. therefore this solution won't work for me. – Richa Sinha Oct 11 '18 at 11:18