NGram in-order search

Question

A few month ago I asked a similar question here. However I cannot get it work properly:

I try to build a simple filename search. I want that the user can search for any part of the filename.
Let's say the following filenames are indexed:

[1] My_file_2012.01.12.txt
[2] My_file_2012.01.05.txt
[3] My_file_2012.05.01.txt
[4] My_file_2012.08.27.txt
[5] My_file_2012.12.12.txt
[6] My_file_2011.12.12.txt
[7] file_01_2012.09.09.txt

Then the user might search for:

"ile_20"                    (finds the first six documents)
"12.txt"                    (finds 1, 5, 6)
"12" followed by "01"       (finds 1, 2, 3 - NOT 7)
"2012" followed by "01"     (finds 1, 2, 3 - NOT 7)

(Note: Yes, the user might really search for strings like "ile_20" ... e.g. because of copy-and-paste mistakes)

Therefore I use a nGram-tokenizer to index each part of the filename. This works fine so far. To support the "followed by"-search mentioned above I need a query that respects the order of the terms, no matter how many text is between these two terms (okay let's say max. 100 characters).

Since a "text_phrase"-query with a "slop" does not respect the ordering of the terms correctly, I decided to use a "span_near" query. This works fine in most cases.

See here my full example-index incl. error-description: click

As mentioned in the example above the query "'2012' followed by '01'" does not work since the nGram tokenizer generates a position-value for each token, but these values are not very useful when used by the "span_near" query. While indexing, the term "2012" is assigned to a position value (50) which is bigger than the position value for the term "01" (e.g. 10). Since 50 and 10 are not in order the query will have no results. The in-order-thing works only correct for terms which have the same length (e.g. "'12' followed by '01'") or if the terms are ordered by length (e.g. "'20' followed by '.12'").

So how can I achieve the correct search-behaviour? I just want the ability to search for any part(s) of the filename while respecting the order of the terms.
Maybe there is a way to tell "span_near" to not use the position but instead the "start_offset"? Or is there another query I can use?

score 0 · Accepted Answer · answered Sep 06 '12 at 01:08

0

How about a wildcard search like this:

"12" followed by "01" -> 12*01

answered Sep 06 '12 at 01:08

MD Luffy

536
6
18

Yes, this is what I do since yesterday. It works because due to the NGram-tokenizer each possible search-term is indexed. However I wonder if this can cause performance issues. I already speed up the search drastically by using an edgeNGram. – Biggie Sep 06 '12 at 10:26
There is a limited way you can do that: For eg you can do it only on dates. In plain english, it would be "mysubstring starts with A and ends with B". I'm speaking in terms of solr, so translate as appropriate. 1. Copy to a new field, lets call it FieldFront 2. use regex and retain only the portion that you are interested in. (For eg: [0-9\.]+ would match on contiguous number or dot substring) 3. Apply an edge n-gram on the left Repeat 1-3 with a new copy field FieldRev. Except on step 3, you'd do from the right. Then when you are running your query you can say something like A:12 AND B:01 – MD Luffy Sep 14 '12 at 01:04

NGram in-order search

1 Answers1