0

If i use the keyword_repeat filter in index settings, then when searching for a document through a bool query using should, only the first field of the match condition is searched. Elasticsearch version: 8.7.1

Creating an index

curl -X PUT "elasticsearch:9200/my-test-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {          
          "tokenizer": "default_tokenizer",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "default_stemmer"
          ]
        }
      },
      "tokenizer": {
        "default_tokenizer": {
          "type": "standard"
        }
      },
      "filter": {
        "default_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "unique_stem": {
          "type": "unique",
          "only_on_same_position": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "field1": {
        "type": "text"
      },
      "field2": {
        "type": "text"
      }
    }
  }
}
'

Adding a document

curl -X POST "elasticsearch:9200/my-test-index/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{

"field1": "running man",
"field2": "other text"

}
'

Searching documents

curl -X GET "elasticsearch:9200/my-test-index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        { "match": { "field2":  "running" }},
        { "match": { "field1": "running" }}
      ]
    }
  }
}
'

Response:

{
  "took" : 243,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

I expected the document to be found.

But the request with a different fields order (field1, field2)

curl -X GET "elasticsearch:9200/my-test-index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "should": [
        { "match": { "field1":  "running" }},
        { "match": { "field2": "running" }}
      ]
    }
  }
}
'

Finds a document

{
  "took" : 62,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.92058265,
    "hits" : [
      {
        "_index" : "my-test-index",
        "_id" : "1",
        "_score" : 0.92058265,
        "_source" : {
          "field1" : "running man",
          "field2" : "other text"
        }
      }
    ]
  }
}

I expect the should condition to work like an OR condition, so both queries should have returned a result, regardless of the order of the fields in the query. If I remove keyword_repeat from the index settings, everything works as expected and both queries find documents.

List of tokens for index with keyword_repeat filter

curl -X GET "elasticsearch:9200/my-test-index/_termvectors/1?pretty&fields=field1,field2"

{
  "_index" : "my-test-index",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "took" : 116,
  "term_vectors" : {
    "field2" : {
      "field_statistics" : {
        "sum_doc_freq" : 2,
        "doc_count" : 1,
        "sum_ttf" : 4
      },
      "terms" : {
        "other" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 5
            },
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 5
            }
          ]
        },
        "text" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 6,
              "end_offset" : 10
            },
            {
              "position" : 1,
              "start_offset" : 6,
              "end_offset" : 10
            }
          ]
        }
      }
    },
    "field1" : {
      "field_statistics" : {
        "sum_doc_freq" : 3,
        "doc_count" : 1,
        "sum_ttf" : 4
      },
      "terms" : {
        "man" : {
          "term_freq" : 2,
          "tokens" : [
            {
              "position" : 1,
              "start_offset" : 8,
              "end_offset" : 11
            },
            {
              "position" : 1,
              "start_offset" : 8,
              "end_offset" : 11
            }
          ]
        },
        "run" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 7
            }
          ]
        },
        "running" : {
          "term_freq" : 1,
          "tokens" : [
            {
              "position" : 0,
              "start_offset" : 0,
              "end_offset" : 7
            }
          ]
        }
      }
    }
  }
}

I tried to test different versions of elasticsearch and got the following results:

8.8.1 - works as expected 8.8.0 - works as expected

8.7.1 - problem exists 8.7.0 - problem exists

8.6.2 - works as expected.

Ilya
  • 1
  • 1

1 Answers1

0

The order of your query is not important. So following queries need to return the same results. Maybe because of refresh_interval, you saw the empty result the first time. enter image description here

PUT test_should_index?pretty
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {          
          "tokenizer": "default_tokenizer",
          "filter": [
            "lowercase",
            "keyword_repeat",
            "default_stemmer"
          ]
        }
      },
      "tokenizer": {
        "default_tokenizer": {
          "type": "standard"
        }
      },
      "filter": {
        "default_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "unique_stem": {
          "type": "unique",
          "only_on_same_position": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "field1": {
        "type": "text"
      },
      "field2": {
        "type": "text"
      }
    }
  }
}

### Adding a document
POST test_should_index/_doc/1?pretty&refresh
{

"field1": "running man",
"field2": "other text"

}

### Searching documents
GET test_should_index/_search?pretty
{
  "query": {
    "bool": {
      "should": [
        { "match": { "field2":  "running" }},
        { "match": { "field1": "running" }}
      ]
    }
  }
}

GET test_should_index/_search?pretty
{
  "query": {
    "bool": {
      "should": [
        { "match": { "field1":  "running" }},
        { "match": { "field2": "running" }}
      ]
    }
  }
}

Result:

# GET test_should_index/_search?pretty 200 OK
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.46029133,
    "hits": [
      {
        "_index": "test_should_index",
        "_id": "1",
        "_score": 0.46029133,
        "_source": {
          "field1": "running man",
          "field2": "other text"
        }
      }
    ]
  }
}
# GET test_should_index/_search?pretty 200 OK
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.46029133,
    "hits": [
      {
        "_index": "test_should_index",
        "_id": "1",
        "_score": 0.46029133,
        "_source": {
          "field1": "running man",
          "field2": "other text"
        }
      }
    ]
  }
}
Musab Dogan
  • 1,811
  • 1
  • 6
  • 8
  • 1
    Thanks for the help. I tried adding a "refresh" parameter like in your example but nothing changed, the problem persisted. Then I tried to test different versions of elasticsearch and got the following results: 8.8.1 - works as expected 8.8.0 - works as expected 8.7.1 - problem exists 8.7.0 - problem exists 8.6.2 - works as expected. Therefore, it seems that the problem is relevant for version 8.7.* – Ilya Jun 08 '23 at 19:17
  • I also noticed an interesting detail that when getting a search result for version 8.7.* with the order of fields field1, field2, the response contains max_score = 0.92058265, and with normal behavior in other versions, max_score = 0.46029133 (for both queries). That is exactly 2 times less. But I don't know what it could mean – Ilya Jun 08 '23 at 19:17
  • Wow, great test! thank you so much for sharing with me. It looks like really there is a problem with ES v8.7 The `profile` API can help you to understand "how the scoring works" https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html – Musab Dogan Jun 09 '23 at 08:01