iText - Can't read the content of a PD4ML generated pdf

Question

I have an issue at reading pdf content with iText. I have tested all the different technics. They all work with standard pdf documents, but I have one pdf document that I need to amend and I can't get the content.

This document has been generated by PD4ML. It can be read in Acrobat reader, but cannot be read in Open Office.

For exemple using the command

  PdfReader reader = new PdfReader(src);
  FileOutputStream out = new FileOutputStream(result);
  out.write(reader.getPageContent(1));

Produces this output: q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm /Im1 Do Q /Cs1 cs 0.2 0.2 0.2 sc /Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1 Tf [ <0033> 1 <004800550049> 1 <00520055005000440051004600480003> 1 <0044005100470003>

But when I am trying to get the text context, there are text items, they are not displayed. Like if the text format was different.

This code:

    PdfReader reader = new PdfReader(src);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(result)); TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
      out.println(strategy.getResultantText());
    }

Just produces spaces. Same for TextLocationStrategy.

The command PdfContentReaderTool.listContentStream(new File(src), out);

Produces ==============Page 1==================== - - - - - Dictionary - - - - - - (/Parent=Dictionary of type: /Pages, /Contents=Stream, /Type=/Page, /Resources=Dictionary, /MediaBox=[0, 0, 595.29, 841.89]) Subdictionary /Parent = (/Type=/Pages, /MediaBox=[0, 0, 595.29, 841.89], /Count=6, /Kids=[2 0 R, 14 0 R, 26 0 R, 30 0 R, 34 0 R, 38 0 R]) Subdictionary /Resources = (/XObject=Dictionary, /ProcSet=[/PDF, /Text, /ImageB, /ImageC, /ImageI], /ColorSpace=Dictionary, /Font=Dictionary) Subdictionary /XObject = (/Im1=Stream of type: /XObject) Subdictionary /ColorSpace = (/Cs1=[/ICCBased, 12 0 R]) Subdictionary /Font = (/G2=Dictionary of type: /Font, /G1=Dictionary of type: /Font) Subdictionary /G2 = (/BaseFont=/HCNQGU+font000000001c036002, /DescendantFonts=[50 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0, /ToUnicode=Stream) Subdictionary /G1 = (/BaseFont=/HCZCBJ+font000000001c036002, /DescendantFonts=[43 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0, /ToUnicode=Stream) - - - - - XObject Summary - - - - - - ------ /Im1 - subtype = /Image = 9148 bytes ------

- - - - Content Stream - - - - - - q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm /Im1 Do Q /Cs1 cs 0.2 0.2 0.2 sc /Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1

But The part Text Extraction is empty.

Any idea why I can't read the text? Is there something else I could do or test before getting the text?

Any pointer welcome.

Gilles

Show us the PDF. The PDF string values are stored in the content stream using their hexadecimal notation. I see plenty of `00`s in a regular pattern, which leads to believe that a composite font is used. I read `"071472,3.0,3/` in the first content stream you shared, but maybe the character mapping of the font referred to using `/G1` is all wrong. You do know that bad PDFs can cause the creation of bad output, don't you? — Bruno Lowagie, Nov 27 '15 at 08:23
This pdf is generated by a third party product the company is using. I will try to get a desensitized one and share it. Thanks — Gilles, Nov 27 '15 at 10:16
Ok. It happens that this pdf was corrupted. I can read the content. I am being asked to replace some text in this file. Is there a way to do it. Can't find it in the book. — Gilles, Nov 27 '15 at 14:32
If you've read the intro of [Chapter 6](https://manning-content.s3.amazonaws.com/download/3/3c9ca46-76da-4de2-8972-b82efbe0bf88/samplechapter6.pdf), you know that the person who asked you to replace some text in a PDF doesn't know the first thing about PDF. Please tell him in kind words that his request reveals a deep lack of understanding of the Portable Document Format. — Bruno Lowagie, Nov 27 '15 at 18:20
Instead of changing the content, is it possible to go through the structure and copy it to another document? The text I need to change is well identified. I could put the new text instead the old one to the stream when I am walking through the text structure? — Gilles, Nov 27 '15 at 21:24
Have you taken a look inside the structure? All text is added at absolute positions. Word often can't be recognized as words. If you replace one word with another, the layout will be missed up, unless the width of both words is identical. It is also assumed that you don't introduce characters that aren't known by the font that is "in use". It's tricky business. I strongly advise against it unless you find a way to completely reflow the document. — Bruno Lowagie, Nov 28 '15 at 06:28

iText - Can't read the content of a PD4ML generated pdf

0 Answers0