0

I am trying to read a PDF file trough IText, Program successfully read pdf file but unable to include spaces.

program:

  public void parse(String filename) throws IOException {
        PdfReader reader = new PdfReader(filename);
        PdfReaderContentParser pdfReaderContentParser = new PdfReaderContentParser(reader);
      TextExtractionStrategy strategy = null;
      for (int i=1; i<= reader.getNumberOfPages(); i++) {
           String text = PdfTextExtractor.getTextFromPage(reader, i, new LocationTextExtractionStrategy());
          System.out.println(text);

        }
    }

here is data need to get from pdf

here is data need to get from pdf

When program is reading the pdf then output is:

  DATE MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
   01-04-2017 B/F 54,396.82

if you see in image Date is 01-04-2017 , MODE have empty PARTICULARS value is B/F, DEPOSITS and WITHDRAWALS is also empty value and BALANCE is 54,396.82 same data i need in text form

e.g.-->

 DATE      MODE PARTICULARS DEPOSITS WITHDRAWALS BALANCE
 01-04-2017     B/F                              54,396.82

Need help, thanks in advance.

Baba
  • 311
  • 1
  • 2
  • 12
  • You might want to read [this question and answer](https://stackoverflow.com/a/24911617/1729265). – mkl May 30 '17 at 12:47
  • One of your problems (not related to the question) is that you are using iText 4.x. Whoops. You are using a version that was not released by iText Software. – Amedee Van Gasse May 30 '17 at 13:00

1 Answers1

0

You are extracting text from the PDF, the result is correct, it is not missing spaces, as there are no spaces in the raw text.

However (I missed that earlier, so I'm editing), you are using a LocationTextExtractionStrategy, which is "table-aware". This is good, but at the end getTextFromPage discards that table-aware information.

So instead you could create your own strategy implementation that would extend LocationTextExtractionStrategy, add a getTabulatedText() method to spit out the text with spaces inserted where you want them. Take inspiration from getResultantText(), see how it inserts a single space between each cell... In your code you would insert as many spaces (or tabs) as needed. See this answer for an example.

MyTextExtractionStrategy strategy = new MyTextExtractionStrategy();
for (int i=1; i<= reader.getNumberOfPages(); i++) {
    String rawText = PdfTextExtractor.getTextFromPage(reader, i, strategy);
    String tabulatedText = strategy.getTabulatedText();
    System.out.println(text);
}

(maybe there is a "strategy" implementation that already does that, but I don't know it)

Hugues M.
  • 19,846
  • 6
  • 37
  • 65