How to parse large text file with Apache Tika 1.5?

Question

Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.

My solution: I tried to use TikaInputStream since it provides buffering, then I tried to use BufferedInputStream, but that didn't solve my problem. Here is the my test class below:

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

public class Printer {
    public void readMyFile(String fname) throws IOException, SAXException,
            TikaException {
        System.out.println("Working...");

        File f = new File(fname);
        // InputStream stream = TikaInputStream.get(new File(fname));
        InputStream stream = new BufferedInputStream(new FileInputStream(fname));

        Metadata meta = new Metadata();
        ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
        AutoDetectParser parser = new AutoDetectParser();

        String mime = new Tika().detect(f);
        meta.set(Metadata.CONTENT_TYPE, mime);

        System.out.println("trying to parse...");
        try {
            parser.parse(stream, content, meta, new ParseContext());
        } finally {
            stream.close();
        }
    }

    public static void main(String[] args) {
        Printer p = new Printer();
        try {
            p.readMyFile("test/pagecounts-20140701-060000.txt");
        } catch (IOException | SAXException | TikaException e) {
            e.printStackTrace();
        }
    }
}

Problem: Upon invoking the parse method of the parser I am getting:

Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
    at java.lang.StringBuffer.append(StringBuffer.java:322)
    at java.io.StringWriter.write(StringWriter.java:94)
    at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
    at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
    at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
    at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
    at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
    at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
    at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
    at com.tastyminerals.cli.Printer.main(Printer.java:46)

I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.

Questions: What is wrong with my code? How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?

Would you add more of the OutOfMemoryError stack trace? Then we could see where it lifts off. — cheffe, Jul 03 '14 at 09:02
Using BodyContentHandler probably isn't very smart, as that buffers the whole content into memory before returning. Can you try swapping that for a ContentHandler which processes the output text data as you go without buffering? — Gagravarr, Jul 03 '14 at 09:32
If I use `ContentHandler content = new LinkContentHandler();` the `OutOfMemoryError` does not appear. Guess you are right, I just don't have enough memory. — minerals, Jul 03 '14 at 14:27

score 2 · Accepted Answer · answered Apr 27 '16 at 14:55

2

You can set like this to avoid the limit in size :-

BodyContentHandler bodyHandler = new BodyContentHandler(-1);

answered Apr 27 '16 at 14:55

Sachin

1,675
2
19
42

score 1 · Answer 2 · edited Jan 16 '20 at 11:37

1

You can use incremental parsing

Tika tika = new Tika();
Reader fulltext = null; 
String contentStr = null;
try {                   
    fulltext = tika.parse(response.getEntityInputStream());
    contentStr = IOUtils.toString(fulltext);
} finally {
    fulltext.close();
}

edited Jan 16 '20 at 11:37

Miss Chanandler Bong

4,081
10
26
36

answered Jul 19 '17 at 15:34

duvo

1,634
2
18
30

score 1 · Answer 3 · answered Jun 06 '19 at 02:53

Pass BodyContentHandler a Writer or OutputStream instead of int

As Gagravarr mentioned, the BodyContentHandler you've used is building an internal string buffer of the file's content. Because Tika is trying to store the entire content in memory at once, this approach will hit OutOfMemoryError exception for large files.

If your goal is to write out the Tika parse results to another file for later processing, you can construct BodyContentHandler with a Writer (or OutputStream directly) instead of passing an int:

Path outputFile = Path.of("output.txt"); // Paths.get() if not using Java 11
PrintWriter printWriter = new PrintWriter(Files.newOutputStream(outputFile));
BodyContentHandler content = new BodyContentHandler(printWriter);

And then call Tika parse:

Path inputFile = Path.of("input.txt");
TikaInputStream inputStream = TikaInputStream.get(inputFile);

AutoDetectParser parser = new AutoDetectParser();
Metadata meta = new Metadata();
ParseContext context = new ParseContext();

parser.parse(inputStream, content, meta, context);

By doing this, Tika will automatically write the content to the outputFile as it parses, instead of trying to keep it all in memory. Using a PrintWriter will buffer the output, reducing the number of writes to disk.

Note that Tika will not automatically close your input or output streams for you.

Christoph Ackermann · Answer 4 · 2022-07-10T09:22:52.990

Solution with ByteArrayInputStream

I had a similar problem with CSV files. If they were read in Java with the wrong charset, only a part of the records could be imported. The method from my library assigns the correct encoding to the file and prevents reading errors.

public static String lib_getCharset( String fullFile ) {

    // Initialize variables.    
    String             returnValue = ""; 
    BodyContentHandler handler     = new BodyContentHandler( -1 );
    Metadata           meta        = new Metadata();

    // Convert the BufferedInputStream to a ByteArrayInputStream.
    try( final InputStream is = new BufferedInputStream( new FileInputStream( fullFile ) ) ) {

        InputStream  bais    = new ByteArrayInputStream( is.readAllBytes() );
        ParseContext context = new ParseContext();
        TXTParser    parser  = new TXTParser();
    
        // Run the Tika TXTParser and read the metadata.          
        try {
    
            parser.parse( bais, handler, meta, context );
    
            // Fill the  metadata's names in an array ... 
            String[] metaNames = meta.names();
    
            // ... and iterate over it.
            for( String metaName : metaNames ) {
        
                // Check if a charset is described.
                if( metaName.equals( "Content-Encoding" ) ) {

                    returnValue =  meta.get( metaName );
                }
            }            

        } catch( SAXException | TikaException se_te ) {
        
            se_te.printStackTrace();
        }

    } catch( IOException e ) {

        e.printStackTrace();
    }    

    return returnValue;
}

Using scanner, the file can then be imported as follows.

Scanner scanner     = null;
String  charsetChar = TrnsLib.lib_getCharset( fullFileName );
    
try {

    // Scan the file, e.g. with UTF-8 or
    //                          ISO8859-1 or windows-1252 for ANSI.
    scanner = new Scanner( new File( fullFileName ), charsetChar );
        
} catch( FileNotFoundException e ) {
    
    e.printStackTrace();
}

Don't forget the assignment of the two dependencies in the POM.XML:

and the definition of the requires in module-info.java:

module org.wnt.wnt94lib {
    requires transitive org.apache.tika.core;
    requires transitive org.apache.tika.parser.txt;
}

My solution works fine with small files (up to about 100 lines of 300 characters). Larger files need more attention. The Babylonian confusion around CR and LF led to inconsistencies under Apache Tika. If the parameter is set to -1, the whole text file is read for BodyContentHandler, but only the above-mentioned 100 lines are used to find the correct charset. And especially in CSV files, exotic characters like ä, ö or ü are rare. But, out of luck, Apache finds the combined characters CR and LF and concludes that it must be an ANSI instead of a UTF-8 file.

So, what can you do? - Quick and dirty, you can add the letters - ÄÖÜ to the file's first line. – However, the following solution is better: Load the file with Notepad++. Show all characters under View, Show Symbol. Under Search, Replace... delete all CR. To do this, activate the selection Extended under Search Mode and enter the characters \r\n under Find what and \n under Replace with. Set the cursor on the file's first line and press the button Replace All. It frees the file from the burden of remembering the good old typewriter and converts it into a proper Unix file with UTF-8.

Afterwards, however, do not edit the CSV file with Excel. The programme, which I otherwise really appreciate, converts your file back into one with CR-ballast. For correct saving, without CR, you have to use VBA. Ekkehard Horner describes how at: VBA : save a file with UTF-8 without BOM

How to parse large text file with Apache Tika 1.5?

4 Answers4

Pass BodyContentHandler a Writer or OutputStream instead of int

Solution with ByteArrayInputStream