-1

I am trying to read the text in Java and it isn't doing well. Here is my code

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File pdfFile = new File("1.pdf");
PDFParser parser = new PDFParser(new RandomAccessFile(pdfFile,"rw"));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);

But the result like this

Please wait...

If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.

You can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by visiting http://www.adobe.com/go/reader_download.

For more assistance with Adobe Reader visit http://www.adobe.com/go/acrreader.

Windows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark of Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.

I found this error occurred because of xfa pdf document. But I don't know about xfa format of my pdf document. So please Let me know how can I know about xfa format.

Someone help me please. Thank you!

Shing Ho Tan
  • 931
  • 11
  • 30

1 Answers1

1

To sum up what has been said or hinted at in the comments...

The text quoted by the OP,

Please wait...

If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.

...

is the content of the single PDF page Adobe software commonly puts into PDFs with a pure XFA form.

XFA forms constitute an alternative way to describe forms in PDFs. In contrast to the AcroForm way, XFA forms only use PDFs as an envelope carrying a XML stream describing properties, behavior, and values of the form in a way unrelated to any other PDF structure.

Thus, many PDF processors offer a rudimentary support for XFA forms only (or none at all), the main exception being (obviously) Adobe products.

As a result XFA has been marked deprecated in the current PDF specification ISO 32000-2.


In case of PDFBox the XFA support is restricted to the feature of retrieval of the XFA XML data. Text extraction using the PdfTextStripper and related classes only operates on the regular PDF content and, therefore, only retrieves the text reported by the OP.

To access the content of XFA forms, you can retrieve the XFA resource using PDAcroForm.getXFA().

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265