-1

I have encountered a problem when I am extracting text from PDF.

01-29 09:44:15.397: E/dalvikvm-heap(8037): Out of memory on a 5440032-byte allocation.

I looked up the contents of the page and it has a image above the text. What i want to know is how do I catch the error and skip that page? I have tried:

try {
        pages = new String[pdfPage];
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        TextExtractionStrategy strategy;
        for (int pageNum = 1; pageNum <= pdfPage; pageNum++) {
            // String original_content = "";
            // original_content = PdfTextExtractor.getTextFromPage(reader,
            // pageNum, new SimpleTextExtractionStrategy());
            Log.e("MyActivity", "PageCatch: " + (pageNum + fromPage));
            strategy = parser.processContent(pageNum,
                    new SimpleTextExtractionStrategy());
            readPDF(strategy.getResultantText(), pageNum - 1);
        }
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

The try catch above does not catch the error of strategy = parser.processContent(pageNum, new SimpleTextExtractionStrategy()); I already tried commenting out all the lines inside the for loop and no error. but when i leave out strategy = parser.processContent(pageNum, new SimpleTextExtractionStrategy()); and it errors.

Christian Eric Paran
  • 980
  • 6
  • 27
  • 49
  • How big is the PDF in question? Which PdfReader constructor do you use? Try the one using the random access file or array constructor. – mkl Jan 29 '13 at 06:15

2 Answers2

0

as i have understood about the error, that occurs when the memory is not enough to hold the data that you are reading, I believe you can't catch that error.

I would strongly suggest you to drop some old data, and make sure to just hold not too heavy data in your variable.

or refer to this

Out of memory error due to large number of image thumbnails to display

Community
  • 1
  • 1
She Smile GM
  • 1,322
  • 1
  • 11
  • 33
0

You want to catch the error and skip that page and tried using

try {
    ...
} catch (Exception e) {
    ...
}

which didn't do the trick. Unless the DalvikVM handles out-of-memory situations completely different than Java VMs, this is no surprise: The Throwable used by Java in such situations is an OutOfMemoryError, i.e. not an Exception but an Error, the other big subtype of Throwable. Thus, you might want to try

} catch (OutOfMemoryError e) {

or

} catch (Error e) {

or even

} catch (Throwable e) {

to handle your issue. Beware, though, when an Error is thrown, this generally means something bad is happening; catching and ignoring it, therefore, might result in a weird program state.

Obviously, though, if you (as you said) only want to try and skip a single page and otherwise continue, you'll have to position the try { ... } catch() { ... } differently, more specifically around the handling of the single page, i.e. inside the loop.

On the other hand, dropping all references to objects held by the PDF library and re-opening the PDF might help, remember Kevin's answer to your question Search Text and Capacity of iText to read on the iText-Questions mailing list. Following that advice you'd have all iText use and a limited loop (for a confined number of pages) inside the try { ... } catch() { ... }, you'd merely remember the last page read in some outer variables.

Furthermore you can limit memory usage by using a PdfReader constructor taking a RandomAccessFileOrArray parameter --- readers constructed that way don't hold all the PDF in memory but instead only the cross reference table and some central objects. All else is read on demand.

mkl
  • 90,588
  • 15
  • 125
  • 265