4

I had a program running for 2 days to build a Lucene index for around 160 million text files, and after the program ended, I tried searching the index and found the index was not correctly built, indexReader.numDocs() returned 0. I checked the index directory, it looked good, all the index data seemed to be there, the directory is 1.5 Gigabytes in size.

I checked my code and found that I forgot to call indexWriter.optimize() and indexWriter.close(), I want to know if it is possible to re-optimize() the index so I don't need to rebuild the whole index from scratch? I don't really want the program to take another 2 days.

Narayan
  • 6,031
  • 3
  • 41
  • 45
neevek
  • 11,760
  • 8
  • 55
  • 73
  • 1
    How do you know, index was corrupt? try opening in LUKE http://www.getopt.org/luke/ , see if it can show the Documents! – Narayan Mar 21 '11 at 06:27

1 Answers1

3

Calling IndexWriter.optimize() is not necessary and can be called at a later time by reopening the index. It just optimizes the documents in the index for better read performance and doesn't otherwise affect anything.

If you forgot to call IndexWriter.close() however then your index might not be complete. Since you processed so many documents it likely flushed most of them, so hopefully you only need to re-index the last ones. Use Luke as suggested for a UI to quickly browse the index to see what state it's in.

WhiteFang34
  • 70,765
  • 18
  • 106
  • 111
  • Thanks for your reply. I think I need to re-index all the files, cause I have no idea what documents are not flushed, I need the index to be exact. – neevek Mar 21 '11 at 09:14
  • You could iterate through the documents in the index to determine which ones exist before you re-index everything. See http://stackoverflow.com/questions/2311845/is-it-possible-to-iterate-through-documents-stored-in-lucene-index – WhiteFang34 Mar 21 '11 at 09:35