I'd like to find in my ElasticSearch index the string outlook.com
inside a text with a match_phrase
query. But I don't want results that are something...@outlook.com
, that are taken with this query:
GET /my_index/_search
{
"size": 1,
"query": {
"bool": {
"should": [],
"must": [
{
"match_phrase": {
"message": {
"query": "outlook.com",
"slop": 0
}
}
}
]
}
}
}
I think that these results are taken because the tokenizer of the standard analyzer separate something...@outlook.com
in [something...],[outlook.com]
with @
as separator.
I tried to put the analyzer whitespace
to tokenize as [something...@outlook.com]
and avoid taking the full emails as results. But with this query:
GET /my_index/_search
{
"size": 1,
"query": {
"bool": {
"should": [],
"must": [
{
"match_phrase": {
"message": {
"query": "outlook.com",
"slop": 0,
"analyzer": "whitespace",
}
}
}
]
}
}
}
still finds results like something...@outlook.com
. How can I do?
UPDATE:
In my mapping, I set standard
analyzer a time ago. So my intuition is that even if I use a whitespace
analyzer at search time, the documents are already tokenized with the standard
one, so the tokenization is no more changeable after the indexing time.
I tried doing a painless script
to match a certain pattern, but my field is type text
so the search takes too much time.
Otherwise, a regexp
query can do something similar:
GET /my_index/_search
{
"size": 1,
"query": {
"bool": {
"should": [],
"must": [
{
"regexp": {
"message": ".*[^A-Za-z0-9\\@]outlook.com[^A-Za-z0-9\\@].*"
}
}
]
}
}
}
But unfortunately reading regexp
syntax documentation there is a limited set of operators. For example with this regex [^A-Za-z0-9\\@]
I mean any characters, but not a @
before outlook.com
and
not an alphanumeric character (this is to simulate the word boundary that we could have with the match_phrase
query). My problem is that if the field starts or ends with Outlook.com
, it's not retrieved because the regex doesn't find a character before or after ([^A-Za-z0-9\\@]
doesn't match the empty string).