What's the most efficient way to retrieve all matching documents from a query in Lucene, unsorted?

Question

I am looking to perform a query for the purposes of maintaining internal integrity; for example, removing all traces of a particular field/value from the index. Therefore it's important that I find all matching documents (not just the top n docs), but the order they are returned in is irrelevant.

According to the docs, it looks like I need to use the Searcher.Search( Query, Collector ) method, but there's no built in Collector class that does what I need.

Should I derive my own Collector for this purpose? What do I need to keep in mind when doing that?

Keep this in mind if you want to return ALL results: http://forums.alfresco.com/en/viewtopic.php?t=13381 — Please treat your mods well., Mar 25 '11 at 17:30
@Rodrigo Could you be a bit more specific? I read over that thread but it appears to have to do with permission checks. Can you explain how that is relevant to my question? — devios1, Mar 25 '11 at 20:00

devios1 · Accepted Answer · 2011-03-31T03:23:19.153

Turns out this was a lot easier than I expected. I just used the example implementation at http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/Collector.html and recorded the doc numbers passed to the Collect() method in a List, exposing this as a public Docs property.

I then simply iterate this property, passing the number back to the Searcher to get the proper Document:

var searcher = new IndexSearcher( reader );
var collector = new IntegralCollector(); // my custom Collector
searcher.Search( query, collector );
var result = new Document[ collector.Docs.Count ];
for ( int i = 0; i < collector.Docs.Count; i++ )
    result[ i ] = searcher.Doc( collector.Docs[ i ] );
searcher.Close(); // this is probably not needed
reader.Close();

So far it seems to be working fine in preliminary tests.

Update: Here's the code for IntegralCollector:

internal class IntegralCollector: Lucene.Net.Search.Collector {
    private int _docBase;

    private List<int> _docs = new List<int>();
    public List<int> Docs {
        get { return _docs; }
    }

    public override bool AcceptsDocsOutOfOrder() {
        return true;
    }

    public override void Collect( int doc ) {
        _docs.Add( _docBase + doc );
    }

    public override void SetNextReader( Lucene.Net.Index.IndexReader reader, int docBase ) {
        _docBase = docBase;
    }

    public override void SetScorer( Lucene.Net.Search.Scorer scorer ) {
    }
}

Just remember to use the docBase value passed to your `SetNextReader`, since the document id passed to `Collect` is specific to the current reader (from `SetNextReader`). You'll need to use (docBase+doc) when calculating ids to use with the topmost reader, the one used when opening your `IndexSearcher`. — sisve, Mar 30 '11 at 18:56
Also, don't forget about `IndexWriter.DeleteDocuments(Query)` if you want to remove matching documents. — sisve, Mar 30 '11 at 18:58
@Simon - Thanks I figured that out myself, when I started getting wonky results. Also, deletion was just an example, I actually do need to retrieve the documents in my real application. — devios1, Mar 31 '11 at 03:21

score 0 · Answer 2 · answered Mar 26 '11 at 00:58

0

No need to write a hit collector if you're just looking to get all the Document objects in the index. Just loop from 0 to maxDoc() and call reader.document() on each doc id, making sure to skip documents that are already deleted:

for (int i=0; i<reader.maxDoc(); i++) {
   if (reader.isDeleted(i))
      continue;
   results[i] = reader.document(i);
}

answered Mar 26 '11 at 00:58

bajafresh4life

12,491
5
37
46

Thanks, but I am interested in actually performing a query, not just getting all the documents in the index. – devios1 Mar 26 '11 at 05:31

What's the most efficient way to retrieve all matching documents from a query in Lucene, unsorted?

2 Answers2

Linked