How can I index very large text files with Lucene? I have created a minimal example below which hits an OutOfMemoryError when presented with a 2GB text file. I was hoping that providing the FileReader to the Field constructor would allow the input file contents to be streamed, but it seems that is not the case.
public static void main(String[] args) {
try {
String indexDir = "c:/temp/ix";
SimpleFSDirectory simpleFsDir = new SimpleFSDirectory(Paths.get(indexDir), SimpleFSLockFactory.INSTANCE);
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setCommitOnClose(true);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter writer = new IndexWriter(simpleFsDir, config);
FieldType storedNotIndexed = new FieldType();
storedNotIndexed.setStored(true);
storedNotIndexed.setIndexOptions(IndexOptions.NONE);
FieldType indexedNotStored = new FieldType();
indexedNotStored.setStored(false);
indexedNotStored.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
Field idField = new Field("Id", "1", storedNotIndexed);
Field contentField = new Field("Content", new FileReader("c:/temp/twoGbTextFile.txt"), indexedNotStored);
Document document = new Document();
document.add(idField);
document.add(contentField);
writer.addDocument(document);
writer.commit();
}
catch(Exception ex){
System.out.println(ex.toString());
}
}
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.(FreqProxTermsWriterPerField.java:209) at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230) at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:46) at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:250) at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:271) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1569) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1314)