I have an Elasticsearch index for the Wikipedia corpus with documents of four types (paragraph
, infobox
, list
, and table
). Each document has the following fields: page_title
, section_path
, and document_text
.
I'm querying this corpus with a question, "what country is Narora located in?"
scoped within a single page ("List of nuclear power stations"
) and document type ("table"
) as the following, which should return the table documents found in this Wikipedia page.
{
"size": 100,
"_source": [
"id",
"document_text",
"section_path",
"page_title",
],
"query": {
"bool": {
"should": [
{
"match": {
"document_text": "what country is Narora located in?"
}
},
{
"match": {
"section_path": "what country is Narora located in?"
}
},
{
"match": {
"page_title": "what country is Narora located in?"
}
}
],
"must": [
{
"match": {
"page_title": "List of nuclear power stations"
}
},
{
"match": {
"paragraph_type": "table"
}
}
]
}
}
}
This query returns empty results. When I remove one of the two should
clauses, either section_path
or page_title
, I get multiple table documents as a result, including the "In service" one here, which has a mention of Narora in its document_text
field.
Now, granted that the page_title ("List of nuclear power stations") and section_path ("In service", "Under construction" ...) don't overlap with the question. But I'm surprised by this behavior as the should
clause is only supposed to affect the scoring, and not what matches (source). So adding a should
clause shouldn't cause Elasticsearch to return an empty result.
Any thoughts on what could be going on here? Is there any way to force Elasticsearch to ignore the should
clause/s if there's no match and still return ALL the documents that match the rest of the must
criteria?