Cannot read generated text of pdf file in Java

Question

I am trying to read the text in Java and it isn't doing well. Here is my code

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File pdfFile = new File("1.pdf");
PDFParser parser = new PDFParser(new RandomAccessFile(pdfFile,"rw"));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);

But the result like this

Please wait...

If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.

You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download.

For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.

Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.

I found this error occurred because of xfa pdf document. But I don't know about xfa format of my pdf document. So please Let me know how can I know about xfa format.

Someone help me please. Thank you!

can you share the PDF file that is generated from your program? — Adi Ohana, Apr 21 '19 at 15:00
Sorry but, the pdf contains some of my business info, so I couldn't share. It contains the input field, buttons, and checkbox. Do you have any idea about this problem? It is showing correctly when I open it with Adobe Reader — Shing Ho Tan, Apr 21 '19 at 15:05
Thank you. Yeah, it is xfa forms. I think PDFBox support xfa forms. https://stackoverflow.com/questions/10536334/combining-xfa-with-pdfbox However, In my code, it's not working... — Shing Ho Tan, Apr 21 '19 at 15:52
I want to see XML tags or components of this pdf and is there any way? — Shing Ho Tan, Apr 22 '19 at 00:56
You can inspect PDF documents with PDFDebugger. Click on "view", "show internal structure", then go to "Root/AcroForm". — Tilman Hausherr, Apr 22 '19 at 10:53
Thank you Tilman. Where is the PDFDebugger? And How Can I use it? — Shing Ho Tan, Apr 22 '19 at 10:54
On the PDFBox download page. https://pdfbox.apache.org/download.cgi — Tilman Hausherr, Apr 22 '19 at 11:01
Yeah, I used it but it returns "Please wait..." page It seems it doesn't return xfa document — Shing Ho Tan, Apr 22 '19 at 11:03
You need to click on "view", "show internal structure" in the PDFDebugger menu on the top of the window, then go to "Root/AcroForm" What you did is to show the page. What I told you is to inspect the structures in the left pane. (after having switched to show the internal structures) — Tilman Hausherr, Apr 22 '19 at 11:13
Thank you. I found it. I have one more question. My xfa pdf contains date like issue date: 03/21/2019 But I cannot find it in pdfdebugger root/acroform/xfa left panel — Shing Ho Tan, Apr 22 '19 at 11:18

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

To sum up what has been said or hinted at in the comments...

The text quoted by the OP,

Please wait...

If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.

...

is the content of the single PDF page Adobe software commonly puts into PDFs with a pure XFA form.

XFA forms constitute an alternative way to describe forms in PDFs. In contrast to the AcroForm way, XFA forms only use PDFs as an envelope carrying a XML stream describing properties, behavior, and values of the form in a way unrelated to any other PDF structure.

Thus, many PDF processors offer a rudimentary support for XFA forms only (or none at all), the main exception being (obviously) Adobe products.

As a result XFA has been marked deprecated in the current PDF specification ISO 32000-2.

In case of PDFBox the XFA support is restricted to the feature of retrieval of the XFA XML data. Text extraction using the PdfTextStripper and related classes only operates on the regular PDF content and, therefore, only retrieves the text reported by the OP.

To access the content of XFA forms, you can retrieve the XFA resource using PDAcroForm.getXFA().

Cannot read generated text of pdf file in Java

1 Answers1