I have an issue at reading pdf content with iText. I have tested all the different technics. They all work with standard pdf documents, but I have one pdf document that I need to amend and I can't get the content.
This document has been generated by PD4ML. It can be read in Acrobat reader, but cannot be read in Open Office.
For exemple using the command
PdfReader reader = new PdfReader(src);
FileOutputStream out = new FileOutputStream(result);
out.write(reader.getPageContent(1));
Produces this output: q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm /Im1 Do Q /Cs1 cs 0.2 0.2 0.2 sc /Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1 Tf [ <0033> 1 <004800550049> 1 <00520055005000440051004600480003> 1 <0044005100470003>
But when I am trying to get the text context, there are text items, they are not displayed. Like if the text format was different.
This code:
PdfReader reader = new PdfReader(src);
PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(result)); TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
Just produces spaces. Same for TextLocationStrategy.
The command PdfContentReaderTool.listContentStream(new File(src), out);
Produces ==============Page 1==================== - - - - - Dictionary - - - - - - (/Parent=Dictionary of type: /Pages, /Contents=Stream, /Type=/Page, /Resources=Dictionary, /MediaBox=[0, 0, 595.29, 841.89]) Subdictionary /Parent = (/Type=/Pages, /MediaBox=[0, 0, 595.29, 841.89], /Count=6, /Kids=[2 0 R, 14 0 R, 26 0 R, 30 0 R, 34 0 R, 38 0 R]) Subdictionary /Resources = (/XObject=Dictionary, /ProcSet=[/PDF, /Text, /ImageB, /ImageC, /ImageI], /ColorSpace=Dictionary, /Font=Dictionary) Subdictionary /XObject = (/Im1=Stream of type: /XObject) Subdictionary /ColorSpace = (/Cs1=[/ICCBased, 12 0 R]) Subdictionary /Font = (/G2=Dictionary of type: /Font, /G1=Dictionary of type: /Font) Subdictionary /G2 = (/BaseFont=/HCNQGU+font000000001c036002, /DescendantFonts=[50 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0, /ToUnicode=Stream) Subdictionary /G1 = (/BaseFont=/HCZCBJ+font000000001c036002, /DescendantFonts=[43 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0, /ToUnicode=Stream) - - - - - XObject Summary - - - - - - ------ /Im1 - subtype = /Image = 9148 bytes ------
- Content Stream - - - - - - q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088 775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609 -2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485 cm /Im1 Do Q /Cs1 cs 0.2 0.2 0.2 sc /Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1
But The part Text Extraction is empty.
Any idea why I can't read the text? Is there something else I could do or test before getting the text?
Any pointer welcome.
Gilles