4

I'd like to find in my ElasticSearch index the string outlook.com inside a text with a match_phrase query. But I don't want results that are something...@outlook.com, that are taken with this query:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "outlook.com",
              "slop": 0
            }
          }
        }
      ]
    }
  }
}

I think that these results are taken because the tokenizer of the standard analyzer separate something...@outlook.com in [something...],[outlook.com] with @ as separator.

I tried to put the analyzer whitespace to tokenize as [something...@outlook.com] and avoid taking the full emails as results. But with this query:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "match_phrase": {
            "message": {
              "query": "outlook.com",
              "slop": 0,
              "analyzer": "whitespace",
            }
          }
        }
      ]
    }
  }
}

still finds results like something...@outlook.com. How can I do?

UPDATE:

In my mapping, I set standard analyzer a time ago. So my intuition is that even if I use a whitespace analyzer at search time, the documents are already tokenized with the standard one, so the tokenization is no more changeable after the indexing time.

I tried doing a painless script to match a certain pattern, but my field is type text so the search takes too much time.

Otherwise, a regexp query can do something similar:

GET /my_index/_search
{
  "size": 1,
  "query": {
    "bool": {
      "should": [],
      "must": [
        {
          "regexp": {
            "message": ".*[^A-Za-z0-9\\@]outlook.com[^A-Za-z0-9\\@].*"
          }
        }
      ]
    }
  }
}

But unfortunately reading regexp syntax documentation there is a limited set of operators. For example with this regex [^A-Za-z0-9\\@] I mean any characters, but not a @ before outlook.com and not an alphanumeric character (this is to simulate the word boundary that we could have with the match_phrase query). My problem is that if the field starts or ends with Outlook.com, it's not retrieved because the regex doesn't find a character before or after ([^A-Za-z0-9\\@] doesn't match the empty string).

Paolo Magnani
  • 549
  • 4
  • 14

1 Answers1

1

you can use the regexp query instead of match_phrase like this:

{  "query":{
    "bool": {
      "must": [
        {
          "regexp": {
            "message": ".*[^@]outlook.com"
          }
        }
      ]
    }
  }
}
Mouad Slimane
  • 913
  • 3
  • 12
  • Thank you! I tried with the `regexp` too. Anyway I wanted the same "effect" of match_phrase: so I think before and after `outlook.com` should be something like word boundaries except `@`. Some of the typical regex patterns are not available in the ES syntax (https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html) so it was hard for me to create such a similar pattern. Moreover don't I have to escape `@`? Like `...[^\\@]..`. Anyway I don't understand why the `match_phrase` doesn't search the exact token, but it finds also results that "contains" that string. – Paolo Magnani Apr 19 '23 at 16:07
  • Understood: the analyzer expressed in search time is only applied to the text you put in the query but not in the text that was already indexed in the documents (they were analyzed with the standard analyzer) – Paolo Magnani Apr 20 '23 at 08:57