Elasticsearch PHP longest prefix match

Question

I am currently using the FOSElasticaBundle in Symfony2 and I am having a hard time trying to build a search to match the longest prefix.

I am aware of the 100 examples that are on the Internet to perform autocomplete-like searches using this. However, my problem is a little different.

In an autocomplete type of search the database holds the longest alphanumeric string (in length of characters) and the user just provides the shortest portion, let's say the user types "jho" and Elasticsearch can easily provide "Jhon, Jhonny, Jhonas".

My problem is backwards, I would like to provide the longest alphanumeric string and I want Elasticsearch to provide me the biggest match in the database.

For example: I could provide "123456789" and my database can have [12,123,14,156,16,7,1234,1,67,8,9,123456,0], in this case the longest prefix match in the database for the number that the user provided is "123456".

I am just starting with Elasticsearch so I don't really have a close to working settings or anything.

If there is any information not clear or missing let me know and I will provide more details.

Update 1 (Using Val's 2nd Update)

Index: Download 1800+ indexes

Settings:

curl -XPUT localhost:9200/tests -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "tokenizer": "edge_ngram_tokenizer",
          "filter": [ "lowercase" ]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "my_string": {
          "type": "string",
          "fields": {
            "prefix": {
              "type": "string",
              "analyzer": "edge_ngram_analyzer"
            }
          }
        }
      }
    }
  }
}'


Query:

curl -XPOST localhost:9200/tests/test/_search?pretty=true -d '{
  "size": 1,
  "sort": {
    "_script": {
      "script": "doc.my_string.value.length()",
      "type": "number",
      "order": "desc"
    },
    "_score": "desc" 
  },
  "query": {
    "filtered": {
      "query": {
        "match": {
          "my_string.prefix": "8092232423"
        }
      },
      "filter": {
        "script": {
          "script": "doc.my_string.value.length() <= maxlength",
          "params": {
            "maxlength": 10
          }
        }
      }
    }
  }
}'

With this configuration the query returns the following results:

  {
  "took" : 61,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1754,
    "max_score" : null,
    "hits" : [ {
      "_index" : "tests",
      "_type" : "test",
      "_id" : "AU8LqQo4FbTZPxBtq3-Q",
      "_score" : 0.13441172,
      "_source":{"my_string":"80928870"},
      "sort" : [ 8.0, 0.13441172 ]
    } ]
  }
}

Bonus question

I would like to provide an array of numbers for that search and get the matching prefix for each one in an efficient way without having to perform the query each time

We do a similar search in MySQL, the technique we use if to query all the matches then calculate the largest in php. So search for all items matching [1,12,123,1234,12345,123456,1234567,12345678,123456789]. Then for each result check if it's the longest string. I'm also interested in a better solution. Is your elasticsearch data coming for a KeyValue or Object DB? — Jason Hendry, Aug 07 '15 at 01:25
I am currently doing jt like that, i search for alll the results and prepare a binary tree and then search for the longest prefix bht im worried about memory management and that stuff... atthe moment there are 300,000 entries in that list. Also, how are you querying only the matching results? As for your question it can be either way — Benjamin Vison, Aug 07 '15 at 02:59
Quick question: do you want the response to only include matches that are **shorter** than the input string or longer matches are also fine? i.e. if I input "123456789" and the database contains "123456789abcd" would that also qualify as a match or not? — Val, Aug 07 '15 at 04:24
Yes, that would qualify as a match, anything that starts with that prefix (the longest prefix) is a valid match — Benjamin Vison, Aug 07 '15 at 04:26

Val · Answer 1 · 2015-08-08T06:14:51.820

Here is my take at it.

Basically, what we need to do is to slice and dice the field (called my_string below) at indexing time with an edgeNGram tokenizer (called edge_ngram_tokenizer below). That way a string like 123456789 will be tokenized to 12, 123, 1234, 12345, 123456, 1234567, 12345678, 123456789 and all tokens will be indexed and searchable.

So let's create a tests index, a custom analyzer called edge_ngram_analyzer analyzer and a test mapping containing a single string field called my_string. You'll note that the my_string field is a multi-field declaring a prefixes sub-field which will contain all the tokenized prefixes.

curl -XPUT localhost:9200/tests -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "tokenizer": "edge_ngram_tokenizer",
          "filter": [ "lowercase" ]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "my_string": {
          "type": "string",
          "fields": {
            "prefixes": {
              "type": "string",
              "index_analyzer": "edge_ngram_analyzer"
            }
          }
        }
      }
    }
  }
}

Then let's index a few test documents using the _bulk API:

curl -XPOST localhost:9200/tests/test/_bulk -d '
{"index":{}}
{"my_string":"12"}
{"index":{}}
{"my_string":"1234"}
{"index":{}}
{"my_string":"1234567890"}
{"index":{}}
{"my_string":"abcd"}
{"index":{}}
{"my_string":"abcdefgh"}
{"index":{}}
{"my_string":"123456789abcd"}
{"index":{}}
{"my_string":"abcd123456789"}
'

The thing that I found particularly tricky was that the matching result could be either longer or shorter than the input string. To achieve that we have to combine two queries, one looking for shorter matches and another for longer matches. So the match query will find documents with shorter "prefixes" matching the input and the query_string query (with the edge_ngram_analyzer applied on the input string!) will search for "prefixes" longer than the input string. Both enclosed in a bool/should and sorted by a decreasing string length (i.e. longest first) will do the trick.

Let's do some queries and see what unfolds:

This query will return the one document with the longest match for "123456789", i.e. "123456789abcd". In this case, the result is longer than the input.

curl -XPOST localhost:9200/tests/test/_search -d '{
  "size": 1,
  "sort": {
    "_script": {
      "script": "doc.my_string.value.length()",
      "type": "number",
      "order": "desc"
    }
  },
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_string.prefixes": "123456789"
          }
        },
        {
          "query_string": {
            "query": "123456789",
            "default_field": "my_string.prefixes",
            "analyzer": "edge_ngram_analyzer"
          }
        }
      ]
    }
  }
}'

The second query will return the one document with the longest match for "123456789abcdef", i.e. "123456789abcd". In this case, the result is shorter than the input.

curl -XPOST localhost:9200/tests/test/_search -d '{
  "size": 1,
  "sort": {
    "_script": {
      "script": "doc.my_string.value.length()",
      "type": "number",
      "order": "desc"
    }
  },
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "my_string.prefixes": "123456789abcdef"
          }
        },
        {
          "query_string": {
            "query": "123456789abcdef",
            "default_field": "my_string.prefixes",
            "analyzer": "edge_ngram_analyzer"
          }
        }
      ]
    }
  }
}'

I hope that covers it. Let me know if not.

As for your bonus question, I'd simply suggest using the _msearch API and sending all queries at once.

UPDATE: Finally, make sure that scripting is enabled in your elasticsearch.yml file using the following:

 # if you have ES <1.6
 script.disable_dynamic: false

 # if you have ES >=1.6
 script.inline: on

UPDATE 2 I'm leaving the above as the use case might fit someone else's needs. Now, since you only need "shorter" prefixes (makes sense !!), we need to change the mapping a little bit and the query.

The mapping would be like this:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "edge_ngram_analyzer": {
          "tokenizer": "edge_ngram_tokenizer",
          "filter": [
            "lowercase"
          ]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "my_string": {
          "type": "string",
          "fields": {
            "prefixes": {
              "type": "string",
              "analyzer": "edge_ngram_analyzer"  <--- only change
            }
          }
        }
      }
    }
  }
}

And the query would now be a bit different but will always return only the longest prefix but shorter or of equal length to the input string. Please try it out. I advise to re-index your data to make sure everything is setup properly.

{
  "size": 1,
  "sort": {
    "_script": {
      "script": "doc.my_string.value.length()",
      "type": "number",
      "order": "desc"
    },
    "_score": "desc"           <----- also add this line
  },
  "query": {
    "filtered": {
      "query": {
        "match": {
          "my_string.prefixes": "123"  <--- input string
        }
      },
      "filter": {
        "script": {
          "script": "doc.my_string.value.length() <= maxlength",
          "params": {
            "maxlength": 3      <---- this needs to be set to the length of the input string
          }
        }
      }
    }
  }
}

any idea why when querying "123456789" its returning all the hits? including abcd for example.. I am using your samples — Benjamin Vison, Aug 07 '15 at 14:04
You mean the first query above? For me, if I remove the size parameter, it's only returning four hits, namely `123456789abcd`, `1234567890`, `1234`, `12`. Are you sure about the way you're querying (i.e. not using GET inside the `/head/` plugin for instance)? — Val, Aug 07 '15 at 14:09
Nevermind, I was using the cURL incorrectly in PHP, I am using plain curl just to test before I move on to elastica, I want to confirm this work.. this is the error that I'm getting [link](https://www.sendspace.com/file/82ypnl) Apparently it's an error in the parsing but since I'm new to this I don't really know where to debug in the string — Benjamin Vison, Aug 07 '15 at 14:31
My bad actually, I forgot to mention that you need to enable groovy scripting. Open your `elasticsearch.yml` file, add `script.disable_dynamic: false` towards the end and restart your cluster. — Val, Aug 07 '15 at 14:35
Awesome stuff, I am testing it on the terminal and so far is working properly with one minor thing, I added around 2,000 new prefixes and there are a few prefixes like 242, 2423, 242357, 242359. would it be too dificult to add that if I search for '242' only the 242 in the database appears? like if the prefix is the same as the input since it's not showing up in the results — Benjamin Vison, Aug 07 '15 at 16:27
So, to sum it up, if there's an exact match we show it, otherwise we show the longest matching prefix? — Val, Aug 07 '15 at 16:35
mm to rephrase my comment, if the provided string is shorter (i.e. 242) then I don't think that 2423 or 242357 or 242359 should be valid prefixes, because since we are looking for a 'prefix' for the given string, those wouldn't be valid 'prefixes' or exact matchs. — Benjamin Vison, Aug 07 '15 at 16:38
Haha! Hence my question 12 hours ago whether longer "prefixes" were deemed valid or not :-) — Val, Aug 07 '15 at 16:43
Omfg, you are right... I am sorry lol i was almost sleepying when I replied.. I didn't read the question properly _-_ — Benjamin Vison, Aug 07 '15 at 17:01
hmm, the changes you made seem very reasonable, but I am getting empty results on any input string, I even created a new index /tests2 and tried everything (settings, reindex all docs, query using this one) and still empty results I don't know if I'm missing something should I post my settings and query and some data samples? — Benjamin Vison, Aug 07 '15 at 19:19
Just making sure: in your gist you create the index after indexing data? or is it just a bad copy/paste order? — Val, Aug 08 '15 at 05:05
Oh!! Actually, you have a typo in your query, the sub-field is called `prefix` in your settings and `prefixes` in your query. That should do it. — Val, Aug 08 '15 at 05:12
You are correct again Sir!, I fixed the typo and now I am getting some output, but still not always the correct one, see above I changed the query/output a little bit. when I search for "2423579999" it returns 242376 even if 242357 is available in the indexes, is this happening on your end as well? — Benjamin Vison, Aug 08 '15 at 05:23
Please check my updated answer, I've complemented the `sort` with `_score: desc`. Also make sure that your `maxlength` parameter always reflects the length of the input (it didn't in your last update, i.e. `maxlength = 3` while the input is 10 chars long). let me know. — Val, Aug 08 '15 at 06:16
I updated the above, with the sort changes it actually showed the 242357 but after I tried with another one (809223) it returned something else, I also provided the indexing that I am using to test so feel free to attempt a few — Benjamin Vison, Aug 08 '15 at 14:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/85511/discussion-between-val-and-benjamin-vison). — Val, Aug 08 '15 at 14:37

Elasticsearch PHP longest prefix match

1 Answers1