I have some documents stored in a Lucene index with a docId field. I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?
5 Answers
IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
// do something with docId here...
}

- 12,491
- 5
- 37
- 46
-
Without the isDeleted() check, you would output id's for documents that had been previously deleted – bajafresh4life Feb 25 '10 at 03:34
-
To complete comment from above. Index changes are commited when index is reopen so reader.isDeleted(i) is necessary to guarantee that documents are valid. – Eugeniu Torica Feb 24 '11 at 11:29
-
1@Jenea what is the equivalent method in Java for checking whether the document is already deleted or not? i am looking for similar functionality.. i dont want to consider the document which is already deleted. – Shankar Jun 09 '15 at 12:16
-
The IndexReader.isDeleted() is gone since at least 2010 (Git changeset 6a4bfc796fea6ed3474350adb271e06275d22e6a). Definitely not present in Lucene 4.x. – Vlad Jan 11 '23 at 16:58
Lucene 4
Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
if (liveDocs != null && !liveDocs.get(i))
continue;
Document doc = reader.document(i);
}
See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html

- 25,987
- 18
- 90
- 141
-
This was rolled back by another user but the original editor was correct, liveDocs can be null – bcoughlan Nov 01 '13 at 15:24
-
1
There is a query class named MatchAllDocsQuery
, I think it can be used in this case:
Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);

- 1,750
- 1
- 20
- 34
Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method

- 2,718
- 17
- 16
If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results. i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).
IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
if (reader.isDeleted(i))
continue;
Document doc = reader.document(i);
String docId = doc.get("docId");
}
You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.
An alternative I found is something like this:
is = getIndexSearcher(); //new IndexSearcher(indexReader)
//get all results without any conditions attached.
Term term = new Term([[any mandatory field name]], "*");
Query query = new WildcardQuery(term);
topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
is.search(query, topCollector);
TopDocs topDocs = topCollector.topDocs(offset, count);
note: replace text between [[ ]] with own values. Ran this on large index with 1.5million entries and got random 10 results in less than a second. Agree is slower but at least you can ignore deleted documents if you need pagination.

- 935
- 12
- 18