29

I could not find a perfect solution either in Google or ES for the following situation, hope someone could help here.

Suppose there are five email addresses stored under field "email":

1. {"email": "john.doe@gmail.com"}
2. {"email": "john.doe@gmail.com, john.doe@outlook.com"}
3. {"email": "hello-john.doe@outlook.com"}
4. {"email": "john.doe@outlook.com}
5. {"email": "john@yahoo.com"}

I want to fulfill the following searching scenarios:

[Search -> Receive]

"john.doe@gmail.com" -> 1,2

"john.doe@outlook.com" -> 2,4

"john@yahoo.com" -> 5

"john.doe" -> 1,2,3,4

"john" -> 1,2,3,4,5

"gmail.com" -> 1,2

"outlook.com" -> 2,3,4

The first three matchings is a MUST, and for the rest of them the more precise the better. Have already tried different combinations of index/search analyzers, tokenizers, and filters. Also tried to work on the condition for match queries, but did not find an ideal solution, any thought is welcome, and no limit to the mappings, analyzers, or which kind of query to use, thanks.

LYu
  • 2,316
  • 4
  • 21
  • 38

1 Answers1

43

Mapping:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "email": {
          "type": "pattern_capture",
          "preserve_original": 1,
          "patterns": [
            "([^@]+)",
            "(\\p{L}+)",
            "(\\d+)",
            "@(.+)",
            "([^-@]+)"
          ]
        }
      },
      "analyzer": {
        "email": {
          "tokenizer": "uax_url_email",
          "filter": [
            "email",
            "lowercase",
            "unique"
          ]
        }
      }
    }
  },
  "mappings": {
    "emails": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "email"
        }
      }
    }
  }
}

Test data:

POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "john.doe@gmail.com"}
{"index":{"_id":"2"}}
{"email": "john.doe@gmail.com, john.doe@outlook.com"}
{"index":{"_id":"3"}}
{"email": "hello-john.doe@outlook.com"}
{"index":{"_id":"4"}}
{"email": "john.doe@outlook.com"}
{"index":{"_id":"5"}}
{"email": "john@yahoo.com"}

Query to be used:

GET /test/emails/_search
{
  "query": {
    "term": {
      "email": "john.doe@gmail.com"
    }
  }
}
Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • Great! I have never tried this pattern capture token filter, can you briefly talk about how many tokens will generate for each field, and is there a strategy to find the correct combinations for different scenarios? – LYu May 08 '15 at 16:58
  • 3
    Just walked through the documentation, really could not understand why I didn't find it myself, thanks again, no more explanation is needed, just in case someone needs this: [http://www.elastic.co/guide/en/elasticsearch/reference/1.5/analysis-pattern-capture-tokenfilter.html](http://www.elastic.co/guide/en/elasticsearch/reference/1.5/analysis-pattern-capture-tokenfilter.html) – LYu May 08 '15 at 17:02
  • @Andrei , i am writting a Java application, how can i use above mapping in application. – TeamZ Jan 02 '20 at 12:10
  • When I am using simple_query_string query, it's not working for me. it's giving me result which has only abc@xyz.com "simple_query_string" : { "query" : "abc.lmp@xyz.com", "fields" : [ "emailAddress^1.0" ], "flags" : -1, "default_operator" : "or", "lenient" : false, "analyze_wildcard" : true, "boost" : 1.0 } – Mr bond Apr 15 '20 at 02:14