3

I need to search contacts by email. According to ES documentation the best way to achieve that is using uax_url_email tokenizer. Here is my index settings:

settings: {
  index: {
    creation_date: "1467895098804",
    analysis: {
      analyzer: {
        email: {
          type: "custom",
          tokenizer: "uax_url_email"
        }
      }
    },
    number_of_shards: "5",
    number_of_replicas: "1",
    uuid: "wL0P6OIaQqqYpFDvIHArTw",
    version: {
      created: "2030399"
    }
  }
}

and mapping:

contact: {
  dynamic: "false",
  properties: {
    contact_status: {
      type: "string"
    },
    created_at: {
      type: "date",
      format: "strict_date_optional_time||epoch_millis"
    },
    email: {
      type: "string"
    },
    id: {
      type: "long"
    },
    mailing_ids: {
      type: "long"
    },
    subscription_status: {
      type: "string"
    },
    type_ids: {
      type: "long"
    },
    updated_at: {
      type: "date",
      format: "strict_date_optional_time||epoch_millis"
    },
    user_id: {
      type: "long"
    }
  }
}

After creating index I've inserted two documents:

curl -X PUT 'localhost:9200/contacts/contact/1' -d '{"contact_status": "confirmed", "email": "example@gmail.com", "id": "1", "user_id": "1", "subscription_status": "on"}'

and

curl -X PUT 'localhost:9200/contacts/contact/2' -d '{"contact_status": "confirmed", "email": "example@yahoo.com", "id": "2", "user_id": "2", "subscription_status": "on"}'

Then I'm trying to search contacts by email in different ways:

curl -X POST 'localhost:9200/contacts/_search?pretty' -d '{"query": {"bool": {"must": [ {"match": {"_all": { "query": "example@google.com", "analyzer": "email" } } } ] } } }'

I expected to get 1 result with id=1, but got empty hits:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

The next search query I've tested was:

curl -X POST 'localhost:9200/contacts/_search?pretty' -d '{"query": {"bool": {"must": [ {"match": {"_all": { "query": "example@google", "analyzer": "email" } } } ] } } }'

which returned 2 results:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.016878016,
    "hits" : [ {
      "_index" : "contacts",
      "_type" : "contact",
      "_id" : "2",
      "_score" : 0.016878016,
      "_source" : {
        "contact_status" : "confirmed",
        "email" : "example@yahoo.com",
        "id" : "2",
        "user_id" : "2",
        "subscription_status" : "on"
      }
    }, {
      "_index" : "contacts",
      "_type" : "contact",
      "_id" : "1",
      "_score" : 0.016878016,
      "_source" : {
        "contact_status" : "confirmed",
        "email" : "example@gmail.com",
        "id" : "1",
        "user_id" : "1",
        "subscription_status" : "on"
      }
    } ]
  }
}

But as you understand I expected to get 1 document in search result. What am I doing wrong?

Hroft
  • 3,887
  • 4
  • 19
  • 24
  • If `email` only contains the email address why don't you make that field `"index":"not_analyzed"` and then use a `term` filter to search for the email address? – Andrei Stefan Jul 07 '16 at 13:28
  • Because I also need to search by user_id, id and other fields. More over I want to search by part of email, like this: type `example` in input and get list of emails which contains 'example', in my case - both documents. Or if I type `gmail.com` => get document with id 1 – Hroft Jul 07 '16 at 13:33
  • I suggest this approach: http://stackoverflow.com/questions/30115867/elasticsearch-analyzer-and-tokenizer-for-emails If you have any difficulties or a different use case than that one let me know. – Andrei Stefan Jul 07 '16 at 14:15

3 Answers3

13

Use this to make your request It work for me

GET my_index/_search
{
    "query": {
        "match_phrase_prefix" : {
            "email": "valery@gmail.com"
        }
    }
}

You will have the expecting result

Boston Kenne
  • 778
  • 10
  • 15
8

This is what happened:

The "uax_url_email" tokenizer is equal to the 'standard" tokenizer (meaning it cuts out "@") except when it gets a pattern of "<text>@<text>.<text>" in which case it doesn't cut the "@" but takes the whole email address as one token.

Now, at index time you defined "email" field as "string" which defaults to "standard" tokenizer, meaning - your address was tokenized into 2 tokens: "example" and "gmail.com"! At search time you defined "email" tokenizer, meaning, your (first) query "example@google.com" wasn't tokenized at all (since it falles into an email pattern) so it didn't match neither "example" or "gmail.com" (and same for yahoo). In your second query you searched for "example@google" - this doesn't fall into a whole email pattern so the email tokenizer worked as the "standard" tokenizer meaning it cuts the "@" and tokenize "example" and "google" looking for either one in your index. Since example is indexed in your 2 documents - it fits both!

If you want to be able to match only the "example" part of your address - you can't use your "email" analyzer at search time! In any case, most of the time, you shouldn't change your search analyzer from your index analyzer!

Mind that the "standard" analyzer won't cut "gmail.com" into 2 tokens!

israelst
  • 1,042
  • 9
  • 7
  • Thanks for your explanation. I thought specifying analyzer in search query analyzes properties, instead of search query. :( So as I understand I should specify analyzer for email at 'create index' stage. The problem is that I already have such index scheme on production with 600k+ documents in it and the only way to fix this search issue is recreating this index? – Hroft Jul 08 '16 at 06:23
  • You should specify your "email" analyzer in the mapping. That way, every new document will be analyzed with it. In order to do that you only need to create a new index with the new mapping and reindex your old index to the new one. It's very easy. And again, You will only be able to match the email as a whole! – israelst Jul 08 '16 at 12:27
0

I used

{
  "query": {
    "regexp": {
      "email": {
        "value": "example@gmail.com",
        "flags": "NONE"
      }
    }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/regexp-syntax.html

John Bench
  • 35
  • 4