How to index very large files with Lucene (OutOfMemoryError)

Question

How can I index very large text files with Lucene? I have created a minimal example below which hits an OutOfMemoryError when presented with a 2GB text file. I was hoping that providing the FileReader to the Field constructor would allow the input file contents to be streamed, but it seems that is not the case.

 public static void main(String[] args) {
        try {
            String indexDir = "c:/temp/ix";
            SimpleFSDirectory simpleFsDir = new SimpleFSDirectory(Paths.get(indexDir), SimpleFSLockFactory.INSTANCE);
            StandardAnalyzer analyzer = new StandardAnalyzer();
            IndexWriterConfig config = new IndexWriterConfig(analyzer);
            config.setCommitOnClose(true);
            config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);

            IndexWriter writer = new IndexWriter(simpleFsDir, config);

            FieldType storedNotIndexed = new FieldType();
            storedNotIndexed.setStored(true);
            storedNotIndexed.setIndexOptions(IndexOptions.NONE);

            FieldType indexedNotStored = new FieldType();
            indexedNotStored.setStored(false);
            indexedNotStored.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);

            Field idField = new Field("Id", "1", storedNotIndexed);
            Field contentField = new Field("Content", new FileReader("c:/temp/twoGbTextFile.txt"), indexedNotStored);

            Document document = new Document();
            document.add(idField);
            document.add(contentField);

            writer.addDocument(document);

            writer.commit();
        }
        catch(Exception ex){
            System.out.println(ex.toString());
        }
    }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.(FreqProxTermsWriterPerField.java:209) at org.apache.lucene.index.FreqProxTermsWriterPerField$FreqProxPostingsArray.newInstance(FreqProxTermsWriterPerField.java:230) at org.apache.lucene.index.ParallelPostingsArray.grow(ParallelPostingsArray.java:46) at org.apache.lucene.index.TermsHashPerField$PostingsBytesStartArray.grow(TermsHashPerField.java:250) at org.apache.lucene.util.BytesRefHash.add(BytesRefHash.java:271) at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:149) at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:796) at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:447) at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:403) at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:232) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:478) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1569) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1314)

refer: https://stackoverflow.com/questions/37335/how-to-deal-with-java-lang-outofmemoryerror-java-heap-space-error-64mb-heap — Gaurav, Feb 15 '18 at 11:13
@gaurav I don't think increasing the heap size is really a solution. It just kicks the problem down the road until a bigger file comes along (assuming increasing it even allows this file to index). Preferably I wouldn't need that additional heap space to begin with and can somehow stream the data through. — Owen Pauling, Feb 15 '18 at 11:44
how about fractioning the whole process. like close commit and reopen after a given amount of records / processing size? — dom, Feb 16 '18 at 16:45

How to index very large files with Lucene (OutOfMemoryError)

0 Answers0