0

Consider this very basic T-SQL query:

select * from Users
where FirstName like '%dm0e776467@mail.com%' 
or LastName like '%dm0e776467@mail.com%' 
or Email like '%dm0e776467@mail.com%'

How can I write this in Lucene?

I have tried the following:

  1. The query way (does not work at all, no results):

    { "query": { "bool": { "should": [ { "wildcard": { "firstName": "dm0e776467@mail.com" } }, { "wildcard": { "lastName": "dm0e776467@mail.com" } }, { "wildcard": { "email": "dm0e776467@mail.com" } }
    ] } } }

  2. The Multimatch way (returns anything where mail.com is present)

    { "query": { "multi_match": { "query": "dm0e776467@mail.com", "fields": [ "firstName", "lastName", "email" ] } } }

  3. A third attempt (returns expected result, but if I only insert "mail", then no results are returned)

    { "query": { "query_string": { "query": ""dm0e776467@mail.com"", "fields": [ "firstName", "lastName", "email" ], "default_operator": "or", "allow_leading_wildcard": true } } }

It seems to me as there is no way to force Elasticsearch to force a query to use the input string as ONE substring?

Marcus
  • 8,230
  • 11
  • 61
  • 88

1 Answers1

1

The standard (default) analyzer will tokenize this email as follows:

GET _analyze
{
  "text": "dm0e776467@mail.com",
  "analyzer": "standard"
}

yielding

{
  "tokens" : [
    {
      "token" : "dm0e776467",
      ...
    },
    {
      "token" : "mail.com",
      ...
    }
  ]
}

This explains why the multi-match works with any *mail.com suffix and why the wildcards are failing.


I suggest the following modifications to your mapping, inspired by this answer:

PUT users
{
  "settings": {
    "analysis": {
      "filter": {
        "email": {
          "type": "pattern_capture",
          "preserve_original": true,
          "patterns": [
            "([^@]+)",
            "(\\p{L}+)",
            "(\\d+)",
            "@(.+)",
            "([^-@]+)"
          ]
        }
      },
      "analyzer": {
        "email": {
          "tokenizer": "uax_url_email",
          "filter": [
            "email",
            "lowercase",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "email": {
        "type": "text",
        "analyzer": "email"
      },
      "firstName": {
        "type": "text",
        "fields": {
          "as_email": {
            "type": "text",
            "analyzer": "email"
          }
        }
      },
      "lastName": {
        "type": "text",
        "fields": {
          "as_email": {
            "type": "text",
            "analyzer": "email"
          }
        }
      }
    }
  }
}

Note that I've used .as_email fields on your first- & lastName fields -- you may not want to force them to be mapped as emails by default.

Then after indexing a few samples:

POST _bulk
{"index":{"_index":"users","_type":"_doc"}}
{"firstName":"abc","lastName":"adm0e776467@mail.coms","email":"dm0e776467@mail.com"}
{"index":{"_index":"users","_type":"_doc"}}
{"firstName":"xyz","lastName":"opr","email":"dm0e776467@mail.com"}
{"index":{"_index":"users","_type":"_doc"}}
{"firstName":"zyx","lastName":"dm0e776467@mail.com","email":"qwe"}
{"index":{"_index":"users","_type":"_doc"}}
{"firstName":"abc","lastName":"efg","email":"ijk"}

the wildcards are working perfectly fine:

GET users/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "wildcard": {
            "email": "dm0e776467@mail.com"
          }
        },
        {
          "wildcard": {
            "lastName.as_email": "dm0e776467@mail.com"
          }
        },
        {
          "wildcard": {
            "firstName.as_email": "dm0e776467@mail.com"
          }
        }
      ]
    }
  }
}

Do check how this tokenizer works under the hood to prevent 'surprising' query results:

GET users/_analyze
{
  "text": "dm0e776467@mail.com",
  "field": "email"
}
Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68