Understanding Lucene leading wildcard performance

Question

Lucene does not by default allow leading wildcards in search terms, but this can be enabled with:

QueryParser#setAllowLeadingWildcard(true)

I understand that use of a leading wildcard prevents Lucene from using the index. Searches with a leading wildcard must scan the entire index.

How do I demonstrate the performance of a leading wildcard query? When is it OK to use setAllowLeadingWildcard(true)?

I have built a test index with 10 million documents in the form:

{ name: random_3_word_phrase }

The index is 360M on disk.

My test queries perform well and I have been unable to actually demonstrate a performance problem. For example, querying for name:*ing produces over 1.1 million documents in less than 1 second. Querying name:*ing* produces over 1.5 million documents in the same time.

What is going here? Why isn't this slow? Is 10,000,000 documents not enough? Do the documents need to contains more than just a single field?

You need a bigger index. Try downloading the wikipedia data set and indexing that, then retest... — GalacticJello, Aug 01 '12 at 20:18
@GalacticJello bigger as in more documents? More fields per document? Longer fields? — Landon Kuhn, Aug 01 '12 at 20:33
You didn't mention which version of lucene. Maybe you already know, but the 4.0 (currently alpha) version contains some improvements which make wildcard searches a lot faster. — javanna, Aug 02 '12 at 08:18

GalacticJello · Answer 1 · 2012-08-01T20:38:41.730

Depends on how much memory you have, and how much of the token index is in memory.

A 360MB total index could be searched quite quickly on any old computer. A 360GB index would take a bit longer... ;)

As an example, I fired up an old 2GB index, and searched for "*e".

On a box with 8GB, it returned 500K hits in under 5 seconds. I tried the same index on a box with only 1GB of memory, and it took about 20 seconds.

To illustrate further, here's some generic C# code that basically does a "** E*" type search of 10 million random 3 word phrases.

static string substring = "E";

private static Random random = new Random((int)DateTime.Now.Ticks);//thanks to McAden

private static string RandomString(int size)
{
    StringBuilder builder = new StringBuilder();
    char ch;
    for (int i = 0; i < size; i++)
    {
        ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
        builder.Append(ch);
    }

    return builder.ToString();
}

static void FindSubStringInPhrases()
{
    List<string> index = new List<string>();

    for (int i = 0; i < 10000000; i++)
    {
        index.Add(RandomString(5) + " " + RandomString(5) + " " + RandomString(5));
    }

    var matches = index.FindAll(SubstringPredicate);

}

static bool SubstringPredicate(string item)
{
    if (item.Contains(substring))
        return true;
    else
        return false;
}

After all 10 million phases have been loaded into the list, it still only takes about a second for "var matches = index.FindAll(SubstringPredicate);" to return over 4 million hits.

The point is, memory is fast. Once things can no longer fit into memory and you have to start swapping to disk is when you are going to see performance hits.

score 2 · Answer 2 · answered Aug 01 '12 at 19:55

If I understand it correctly, a part of the index is the term dictionary, which is a sorted list of all the indexed terms. When searching with no wildcards or a trailing wildcard, Lucene can take advantage of the fact that many terms have common prefixes. On the other hand, searching with a leading wildcard scans the entire term dictionary. This is not optimal, but the term dictionary tends to be tiny compared to other parts of the index, such as frequency and position data, so a full scan of the term dictionary is usually not big problem by itself.

Understanding Lucene leading wildcard performance

2 Answers2

Linked