4

I am trying to figure out how to filter a very large set of documents based on keyword matching.

I have 20+million entries with ID and (several) text fields in my SQL database and I want to get all IDs for which the text matches a set of keywords. This includes more complex expressions like:

 (term1 NEAR term2 NEAR term3) AND NOT "A phrase" AND @fieldXYZ "wildcards%aswell*"

The results do not need to be scored, sorted or ranked in any way.

From what I understand the power of Lucene/Solr, Sphinx and ElasticSearch is to give back the TOP documents super fast but they are not really intended to give back ALL documents.

I know that it is possible to do this with a custom Collector in Lucene (see What's the most efficient way to retrieve all matching documents from a query in Lucene, unsorted?) and possibly with Cursors/Scrolling in Solr/Elasticsearch but I am wondering if there is any other technology that is specifically optimized for this problem?

Community
  • 1
  • 1
dennis
  • 683
  • 2
  • 5
  • 18
  • 1
    Can't really help with the actual question, but while Sphinx doesnt have explicit 'cursor' support, its pretty easy to setup with available options. I've been able to fetch on the order of millions of ids in 2-3 seconds for a complex query. – barryhunter Jan 16 '15 at 18:39
  • Could you maybe explain a bit more how you get this performance? The tests I've done with Sphinx by simply requerying with a raised LIMIT offset are not really satisfactory because it looks like he is doing a new search everytime although I stayed under the value for MAX_MATCHES. Shouldn't he keep the results in memory then? Also if I increase MAX_MATCHES to a value that allows to get all matching documents the queries get fairly slow with up to 10s (on a very basic and unoptimized index). – dennis Jan 19 '15 at 14:32
  • 1
    Well the idea is NOT to do it with LIMIT. As you say its horribly inefficient. Need to implement an actual cursor system, using a filter, rather than offset/limit. Example here: http://sphinxsearch.com/forum/view.html?id=7215 - but can squeeze more performance out using SphinxQL instead, and careful use of CUTOFF option. (can also use a sharded distributed index, so that can take advantage of modern multi-core processors) – barryhunter Jan 19 '15 at 15:55
  • Ok, thanks! I am using SphinxQL anyway, so I will see how I am able to implement it and how things will end up working. – dennis Jan 19 '15 at 16:17

2 Answers2

3

From what I understand the power of Lucene/Solr, Sphinx and ElasticSearch is to give back the TOP documents super fast but they are not really intended to give back ALL documents.

Actually, this used to be true, but has gotten much better in recent years. I'll defer to others when it comes to other software options, but Lucene did get some improvements early in the 4.x series to do efficient deep pagination with a cursor.

Elasticsearch has a particularly nice API for this: the Scrolling Search. To use it, you supply your search query, with the scroll parameter. It then returns a scroll_id cursor, which you use to make subsequent requests for each page.

If you don't care about sorting, and just want all of the documents returned, then you can also specify a search type of scan. That will return all the documents in the most efficient ordering, with no particular sorting applied.

I've glossed over some of the details here, you'll want to see the Scrolling Search documentation for a more thorough description.

Solr also supports deep pagination as of Solr 4.7 in SOLR-5463. It adds support for a cursorMark parameter to be used alongside your search requests. Solr then returns a nextCursorMark pointing to each subsequent page.

See the "Using Cursors" section in Solr's Pagination of Results documentation.

It sounds like OP is already familiar with those options, but I thought it was worth fleshing out for the sake of others with a similar question.

Also interesting: my take on the mechanics and efficiency of an Elasticsearch scrolling search.

Community
  • 1
  • 1
Nick Zadrozny
  • 7,906
  • 33
  • 38
  • Hmm, Scrolling Search seems to pretty much fit the 'specifically optimized for this problem' requirement then :) – barryhunter Jan 16 '15 at 18:18
  • The problem I can see with pagination would be that the index has to be requeried for every new 'page'. At least this is how I understand it. This seems to be pretty inefficient, considering that every matching document has already been visited once, especially with queries that take more than a second. – dennis Jan 19 '15 at 14:36
  • 1
    On reading the Scrolling search docs, its says "The scroll parameter tells Elasticsearch how long it should keep the search context alive. " So actully ElasticSearch will maintain state between requests. It its not starting from scratch every time. – barryhunter Jan 19 '15 at 15:45
  • Correct: state is maintained between calls with the Scrolling search. It's designed specifically to be cheap for deep pagination. Traditional search and pagination, on the other hand, is a fresh search every time (cached filters aside). – Nick Zadrozny Jan 19 '15 at 19:22
0

In case it is helpful to anyone dealing with the same problem, here is the solution I am going with.

I am using Lucene with a custom Collector which stores all matching IDs, without any processing:

class IDCollector : Collector
{
    // Offset for multiple reader
    private int docBase;

    // Stores IDs for all hits
    public List<int> HitList { get; set; }

    public IDCollector() 
    { 
         this.HitList = new List<int>(INITIAL_CAPACITY); 
    }

    public override void Collect(int doc) 
    { 
        HitList.Add(doc + docBase); 
    }

    // Order of docs does not matter
    public override bool AcceptsDocsOutOfOrder { get { return true; } }

    // Default implementation, docBase is the offset from reader
    public override void SetNextReader(IndexReader reader, int docBase) 
    { 
        this.docBase = docBase; 
    }

    // Scoring is not necessary
    public override void SetScorer(Scorer scorer) { }
}

This way it is possible to collect all ~30mio IDs for every matching document in about 5.5 seconds for a query like term1* OR term2* OR term3* OR term4*.

Unfortunately, but probably unavoidable, search speed is extremely dependent on the number of hits even without any scoring, sorting or similar processing of the hits.

dennis
  • 683
  • 2
  • 5
  • 18