6

Index:

{
    "settings": {
        "index.percolator.map_unmapped_fields_as_text": true,
    },
    "mappings": {
        "properties": {
            "query": {
                "type": "percolator"
            }
        }
    }
}

This test percolator query works

{
    "query": {
        "match": {
            "message": "blah"
        }
    }
}

This query doesn't work

{
    "query": {
        "simple_query_string": {
            "query": "bl*"
        }
    }
}

Results:

{"took":15,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.13076457,"hits":[{"_index":"my-index","_type":"_doc","_id":"1","_score":0.13076457,"_source":{"query":{"match":{"message":"blah"}}},"fields":{"_percolator_document_slot":[0]}}]}}

Why this simple_query_string query doesn't match the document ?

Rrr
  • 1,747
  • 3
  • 17
  • 22
  • What exactly is your question? do you have an example of what's not working? – Assael Azran Nov 03 '19 at 09:50
  • The question is how to unit percolator index to work with simple_query_string queries. Or how to insert simple_query_string into a percolator index. Basically I’m looking for a working example. – Rrr Nov 03 '19 at 17:28
  • 1
    Your last query `{ "query": { "simple_query_string": {"query": "blah"}, "analyzer" : "my_analyzer" } }` errors out because it is not valid, it should be `{ "query": { "simple_query_string": {"query": "blah", "analyzer" : "my_analyzer" } } }` – Val Nov 04 '19 at 09:56

1 Answers1

3

I don't understand what you are asking either. It may be that you do not understand percolator very well? This is an example I just tried now.

Let's assume you have an index - let's call it test - in which you want to index some documents. This index has the following mapping (just a random test index I have in my test setup):

{  
    "settings": {
        "analysis": {
          "filter": {
            "email": {
              "type": "pattern_capture",
              "preserve_original": true,
              "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)",
                "([^-@]+)"
              ]
            }
          },
          "analyzer": {
            "email": {
              "tokenizer": "uax_url_email",
              "filter": [
                "email",
                "lowercase",
                "unique"
              ]
            }
          }
        }
      },
    "mappings": {
        "properties": {
            "code": {
                "type": "long"
            },
            "date": {
                "type": "date"
            },
            "part": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "val": {
                "type": "long"
            },
            "email": {
              "type": "text",
              "analyzer": "email"
            }
        }
    }
}

You notice it has a custom email analyzer that splits something like foo@bar.com into these tokens: foo@bar.com, foo, bar.com, bar, com.

As the documentation says, you could create a separate percolator index that will hold only your percolator queries, not also the documents themselves. And, even if the percolator index doesn't contain the documents themselves, it should hold the mapping of the index that should hold the documents (test in our case).

This is the mapping of the percolator index (which I called it percolator_index) that also has the special analyzer used for splitting the email field:

{  
    "settings": {
        "analysis": {
          "filter": {
            "email": {
              "type": "pattern_capture",
              "preserve_original": true,
              "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)",
                "([^-@]+)"
              ]
            }
          },
          "analyzer": {
            "email": {
              "tokenizer": "uax_url_email",
              "filter": [
                "email",
                "lowercase",
                "unique"
              ]
            }
          }
        }
      },
    "mappings": {
        "properties": {
            "query": {
                "type": "percolator"
            },
            "code": {
                "type": "long"
            },
            "date": {
                "type": "date"
            },
            "part": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "val": {
                "type": "long"
            },
            "email": {
              "type": "text",
              "analyzer": "email"
            }
        }
    }
}

Its mapping and settings are almost the same with my original index, the only difference being the additional query field which is of type percolator added to the mapping.

The query you are interested it - simple_query_string - should go into a document inside percolator_index. Like so:

PUT /percolator_index/_doc/1?refresh
{
    "query": {
        "simple_query_string" : {
            "query" : "month foo@bar.com",
            "fields": ["part", "email"]
        }
    }
}

To make it more interesting, I added the email field in there to be specifically searched for in the query (by default, all of them are searched).

Now, the aim is to test a document that should eventually go into test index against this simple_query_string query from your percolator index. For example:

GET /percolator_index/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "date":"2004-07-31T11:57:52.000Z","part":"month","code":109,"val":0,"email":"foo@bar.com"
      }
    }
  }
}

What's under document is, obviously, your future (non-existent yet) document. This will be matched against the above defined simple_query_string and will result in a match:

{
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.39324823,
        "hits": [
            {
                "_index": "percolator_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.39324823,
                "_source": {
                    "query": {
                        "simple_query_string": {
                            "query": "month foo@bar.com",
                            "fields": [
                                "part",
                                "email"
                            ]
                        }
                    }
                },
                "fields": {
                    "_percolator_document_slot": [
                        0
                    ]
                }
            }
        ]
    }
}

What if I would have percolated this document instead:

{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "date":"2004-07-31T11:57:52.000Z","part":"month","code":109,"val":0,"email":"foo"
      }
    }
  }
}

(notice that the email is only foo) This is the result:

{
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.26152915,
        "hits": [
            {
                "_index": "percolator_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.26152915,
                "_source": {
                    "query": {
                        "simple_query_string": {
                            "query": "month foo@bar.com",
                            "fields": [
                                "part",
                                "email"
                            ]
                        }
                    }
                },
                "fields": {
                    "_percolator_document_slot": [
                        0
                    ]
                }
            }
        ]
    }
}

Notice that the score is a bit lower than the first percolated document. This is probably like this because foo (my email) matched only one of the terms inside my analyzed foo@bar.com, while foo@bar.com would have matched all of them (thus giving a better score)

Not sure what analyzer are you talking about though. I think the example above covers the only "analyzer" issue/unknown that I think may be a bit confusing.

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • Thank you for such a detailed answer Andrei, but my problem is a little bit different. The simple_query_string from my example doesn't look for a specific field, instead it looks across the all fields in the document. So, I'm looking how to make this cross field query to work under percolator. – Rrr Nov 04 '19 at 07:57
  • That doesn't make a difference. You can remove the `fields` part from the query def. The point is that it's about how you structure your index and what analyzers you define for your fields (I just provided a bit more complex example with an `email` custom analyzer), and not about the percolator. The percolator is just a way to run a document through a set of defined queries (your `simple_query_string`). Defining the mappings for the fields used in the document should be done like the percolator wouldn't even be used. Can you explain a bit more, please? I still don't see an issue here, sorry. – Andrei Stefan Nov 04 '19 at 11:25
  • May I suggest one more thing: play with `simple_query_string` just like you would do if percolator would not be in the picture. Like, using an `analyzer` to analyze the text provided to `simple_query_string`. And when you are done coming up with the final query and the final mapping for your index, move that to a percolator. – Andrei Stefan Nov 04 '19 at 11:31
  • I've updated the question with an actual simple_query_string example which doesn't work. – Rrr Nov 04 '19 at 13:25
  • 1
    Now I understand the actual issue you have. Let me look into it. – Andrei Stefan Nov 04 '19 at 17:38
  • 2
    FYI, I created https://github.com/elastic/elasticsearch/issues/48874 because there is either an undocumented issue or a bug in itself. – Andrei Stefan Nov 06 '19 at 19:51
  • Thank you Andrei, this was really helpful, I hope you guys will sort this out. – Rrr Nov 06 '19 at 22:57