I´m generating PDFs using iText and it works fine. But I need a way to import html styled informations from an existing PDF at some point. I know i could just use the XMLWorker class to generate the text directly from html in my own document. But cause I´m not sure whether it actually supports all html features I´m looking to work around this. Therefore a PDF is generated from html using XSLT. The content of this PDF then should be copied to my document. There are two ways discribed in the book ("iText in Action"). One that parses the PDF and gets you the text (or other informations) from the document using PdfReaderContentParser and TextExtractionStrategy. It looks like this:
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i++){
strategy = parser.processContent(i, new LocationTextExtractionStrategy());
document.add(new Chunk(strategy.getResultantText()));
}
But this only prints plain text to the document. Obviously there are more ExtractionStrategys and maybe one of them does exactly what i want but i couldn´t find it yet.
The second way is to copy an itextpdf.text.Image of each side of the PDF to your document. This is obviously not a good idea, cause it will add the entire page to your document even if there is only one line of text in the existing PDF. Its done like this:
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
PdfReader reader = new PdfReader(pdf);
PdfImportedPage page;
for(int i=1;i<=reader.getNumberOfPages();i++){
page = writer.getImportedPage(reader,i);
document.add(Image.getInstance(page));
}
Like I said this copys all the empty lines at the end of the PDF aswell, but i need to continue my text immediatly after the last line of text. If I could convert this itext.text.Image into a java.awt.BufferedImage I could use getSubImage(); and informations i can extract from the PDF to cut away all the empty lines. But i wasn´t able to find a way to to this.
This are the two ways i found. But cause none of them is suitable for my purpose as they are my question is: Is there a way to import everything except the empty lines at the end, but including text-style informations, tables and everything else from a PDF to my document using iText?