0

I am trying to find the length of text content available inside a docx file. I am able to extract the content using the below code. But when the size is too large i am getting OOM exception. Is there a better way to do this?

    OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
    XWPFDocument doc = new XWPFDocument(opcPackage);
    XWPFWordExtractor we = new XWPFWordExtractor(doc);
    String paragraphs = we.getText();
    System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);

I am getting the error in the below line

    XWPFDocument doc = new XWPFDocument(opcPackage);

Below is the exception

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.xmlbeans.impl.store.CharUtil.allocate(CharUtil.java:397)
    at org.apache.xmlbeans.impl.store.CharUtil.saveChars(CharUtil.java:441)
    at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.text(Cur.java:2922)
    at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3043)
    at org.apache.xmlbeans.impl.store.Cur$CurLoadContext.attr(Cur.java:3060)
    at org.apache.xmlbeans.impl.store.Locale$SaxHandler.startElement(Locale.java:3254)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.reportStartTag(Piccolo.java:1082)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseAttributesNS(PiccoloLexer.java:1802)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseOpenTagNS(PiccoloLexer.java:1521)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseTagNS(PiccoloLexer.java:1362)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXMLNS(PiccoloLexer.java:1293)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.parseXML(PiccoloLexer.java:1261)
    at org.apache.xmlbeans.impl.piccolo.xml.PiccoloLexer.yylex(PiccoloLexer.java:4808)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yylex(Piccolo.java:1290)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.yyparse(Piccolo.java:1400)
    at org.apache.xmlbeans.impl.piccolo.xml.Piccolo.parse(Piccolo.java:714)
    at org.apache.xmlbeans.impl.store.Locale$SaxLoader.load(Locale.java:3439)
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1270)
    at org.apache.xmlbeans.impl.store.Locale.parseToXmlObject(Locale.java:1257)
    at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.parse
    (SchemaTypeLoaderBase.java:345)
    at org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument$Factory.
    parse(Unknown Source)
    at org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:135)
    at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
    at org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:107)
    at ReadDocFileFromJava.readMyDocument(ReadDocFileFromJava.java:24)
    at ReadDocFileFromJava.main(ReadDocFileFromJava.java:15)
Cool
  • 35
  • 4
  • 1
    You can try to [increase the heap size for your Java program](http://stackoverflow.com/questions/1565388/increase-heap-size-in-java). For example: `-Xmx2g` – eebbesen Nov 28 '13 at 15:21
  • Yeah i know that option but wanted to know whether there is a different/better way to read the docx file content? Actually i dont want to read it, i just want to know the size of the content. – Cool Nov 28 '13 at 15:26
  • If you just want the size, why not just use a `File` object to check the size of it on disk and be done with it? – Gagravarr Nov 28 '13 at 15:58
  • docx will compress the text content while saving. so the size differs from the actual size. – Cool Nov 28 '13 at 16:27

0 Answers0