Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika.
My solution:
I tried to use TikaInputStream
since it provides buffering, then I tried to use BufferedInputStream
, but that didn't solve my problem. Here is the my test class below:
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.exception.TikaException;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
public class Printer {
public void readMyFile(String fname) throws IOException, SAXException,
TikaException {
System.out.println("Working...");
File f = new File(fname);
// InputStream stream = TikaInputStream.get(new File(fname));
InputStream stream = new BufferedInputStream(new FileInputStream(fname));
Metadata meta = new Metadata();
ContentHandler content = new BodyContentHandler(Integer.MAX_VALUE);
AutoDetectParser parser = new AutoDetectParser();
String mime = new Tika().detect(f);
meta.set(Metadata.CONTENT_TYPE, mime);
System.out.println("trying to parse...");
try {
parser.parse(stream, content, meta, new ParseContext());
} finally {
stream.close();
}
}
public static void main(String[] args) {
Printer p = new Printer();
try {
p.readMyFile("test/pagecounts-20140701-060000.txt");
} catch (IOException | SAXException | TikaException e) {
e.printStackTrace();
}
}
}
Problem:
Upon invoking the parse
method of the parser
I am getting:
Working...
trying to parse...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.StringWriter.write(StringWriter.java:94)
at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:92)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:135)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SecureContentHandler.characters(SecureContentHandler.java:270)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)
at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)
at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)
at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)
at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)
at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)
at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:88)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at com.tastyminerals.cli.Printer.readMyFile(Printer.java:37)
at com.tastyminerals.cli.Printer.main(Printer.java:46)
I tried to increase jre memory consumption up to -Xms512M -Xmx1024M, that didn't work and I don't want to use any bigger values.
Questions: What is wrong with my code? How should I modify my class to make it extract text from a test file >300 MB with Apache Tika?