0

I´m generating PDFs using iText and it works fine. But I need a way to import html styled informations from an existing PDF at some point. I know i could just use the XMLWorker class to generate the text directly from html in my own document. But cause I´m not sure whether it actually supports all html features I´m looking to work around this. Therefore a PDF is generated from html using XSLT. The content of this PDF then should be copied to my document. There are two ways discribed in the book ("iText in Action"). One that parses the PDF and gets you the text (or other informations) from the document using PdfReaderContentParser and TextExtractionStrategy. It looks like this:

PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i++){
strategy = parser.processContent(i, new LocationTextExtractionStrategy());
document.add(new Chunk(strategy.getResultantText()));
}

But this only prints plain text to the document. Obviously there are more ExtractionStrategys and maybe one of them does exactly what i want but i couldn´t find it yet.

The second way is to copy an itextpdf.text.Image of each side of the PDF to your document. This is obviously not a good idea, cause it will add the entire page to your document even if there is only one line of text in the existing PDF. Its done like this:

PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
PdfReader reader = new PdfReader(pdf);
PdfImportedPage page;
for(int i=1;i<=reader.getNumberOfPages();i++){
page = writer.getImportedPage(reader,i);
document.add(Image.getInstance(page));
}

Like I said this copys all the empty lines at the end of the PDF aswell, but i need to continue my text immediatly after the last line of text. If I could convert this itext.text.Image into a java.awt.BufferedImage I could use getSubImage(); and informations i can extract from the PDF to cut away all the empty lines. But i wasn´t able to find a way to to this.

This are the two ways i found. But cause none of them is suitable for my purpose as they are my question is: Is there a way to import everything except the empty lines at the end, but including text-style informations, tables and everything else from a PDF to my document using iText?

moli
  • 24
  • 4
  • 2
    As soon as you generate the PDF using XSLT, you loose all semantic information (e.g. which letters form a text line, which lines form a paragraph, which form a column, where the next line should start, ...). Thus, your approach in my opinion is leading down a blind alley. That been said, though, you could trim away empty space of the XSLT generated PDF using a `PdfStamper` and then import the trimmed pages as in your code. Look at e.g. [Using iTextPDF to trim a page's whitespace](http://stackoverflow.com/a/20212172/1729265); that answer uses iText/Java but should be adaptable to iTextSharp/C#. – mkl Aug 13 '15 at 10:31
  • @mkl This actually looks pretty good. But i found no way to add the created whitespace-free PDF into my document. Adding it using my second code snippet will insert an entire page with whitespaces. I can´t see a possibility to add it to the document directly cause i can´t wrap it inside an Image or Element. Using the same stream for PdfWriter and PdfStamper seems to results in only the Stamper writing to the stream. – moli Aug 13 '15 at 12:43
  • Can you share a sample XSLT output for inspection and demonstration purposes? – mkl Aug 13 '15 at 14:44
  • Unfortunately i can´t provide any sample. But actually any Pdf with whitespaces at the end you might have at hand anyways will work as an example. – moli Aug 14 '15 at 05:59

1 Answers1

0

You can trim away empty space of the XSLT generated PDF and then import the trimmed pages as in your code.

Example code

The following code borrows from the code in my answer to Using iTextPDF to trim a page's whitespace. In contrast to the code there, though, we have to manipulate the media box, not the crop box, because this is the only box respected by PdfWriter.getImportedPage.

Before importing a page from a given PdfReader, crop it using this method:

static void cropPdf(PdfReader reader) throws IOException
{
    int n = reader.getNumberOfPages();
    for (int i = 1; i <= n; i++)
    {
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        MarginFinder finder = parser.processContent(i, new MarginFinder());
        Rectangle rect = new Rectangle(finder.getLlx(), finder.getLly(), finder.getUrx(), finder.getUry());

        PdfDictionary page = reader.getPageN(i);
        page.put(PdfName.MEDIABOX, new PdfArray(new float[]{rect.getLeft(), rect.getBottom(), rect.getRight(), rect.getTop()}));
    }
}

(excerpt from ImportPageWithoutFreeSpace.java)

The extended render listener MarginFinder is taken as is from the question linked to above. You can find a copy here: MarginFinder.java.

Example run

Using this code

PdfReader readerText = new PdfReader(docText);
cropPdf(readerText);
PdfReader readerGraphics = new PdfReader(docGraphics);
cropPdf(readerGraphics);
try (   FileOutputStream fos = new FileOutputStream(new File(RESULT_FOLDER, "importPages.pdf")))
{
    Document document = new Document();
    PdfWriter writer = PdfWriter.getInstance(document, fos);
    document.open();
    document.add(new Paragraph("Let's import 'textOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
    document.add(Image.getInstance(writer.getImportedPage(readerText, 1)));
    document.add(new Paragraph("and now 'graphicsOnly.pdf'", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));
    document.add(Image.getInstance(writer.getImportedPage(readerGraphics, 1)));
    document.add(new Paragraph("That's all, folks!", new Font(FontFamily.HELVETICA, 12, Font.BOLD)));

    document.close();
}
finally
{
    readerText.close();
    readerGraphics.close();
}

(excerpt from unit test method testImportPages in ImportPageWithoutFreeSpace.java)

I imported both the page from the docText document

<code>docText</code> document

and the page from the docGraphics document

<code>docGraphics</code> document

into a new document with some text before, between, and after. The result:

result of the import

As you can see, source styles are preserved but free space around is discarded.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265