4

I generate huge catalogs (~ 1500 pages) as HTML and convert is via Jsoup to and openhtmltopdf (which uses flying saucer) to PDF. In the resulting PDF many links are not clickable, and I can't find out why.

Consider the following program:

import org.jsoup.helper.W3CDom;
import org.w3c.dom.Document;
import org.jsoup.Jsoup;
import com.openhtmltopdf.pdfboxout.PdfRendererBuilder;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;

public class Main {

    public static void main(String[] args) throws Exception {

        PdfRendererBuilder pdfBuilder = new PdfRendererBuilder();

        String html = "<html><head></head><body>";
        for (Integer i = 0; i < 10000; i++) {
            html += "<a href='http://www.google.de?q=" + i + "'>blabla</a>    <br>";
        }
        html += "</body></html>";

        File file = new File("/tmp/tmp.pdf");
        FileOutputStream fop = new FileOutputStream(file);

        W3CDom w3cDom = new W3CDom();
        Document w3cDoc = w3cDom.fromJsoup(Jsoup.parse(html));

        pdfBuilder.withW3cDocument(w3cDoc, "/");
        pdfBuilder.toStream(fop);
        try {
            pdfBuilder.run();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

It creates a PDF with 176 pages and 10.000 links. On page 1 to 3 they are clickable, afterwards they are not, although identical. The last clickable link is the number 112 and in the source code I find:

870 0 obj
<<
/W 0.0
/S /S
>>
endobj
871 0 obj
<<
/S /URI
/URI (http://www.google.de?q=111)
>>
endobj
872 0 obj
<<
/W 0.0
/S /S
>>
endobj
873 0 obj
<<
/S /URI
/URI (http://www.google.de?q=112)
>>
endobj
874 0 obj
<<
/W 0.0
/S /S
>>
endobj
875 0 obj
<<
/F1 1049 0 R
>>
endobj

Apparently after number 112 there are no URLs stored anymore in the annotation objects.

My Program is much more complicated naturally. On the first five or six pages of it's result all the links are clickable, after that some are and most are not. Which ones are still clickable seems to be completely random though.

Can anyone help here? Any Idea what may cause this issue or how to fix it? A bug in openhtmltopdf?

--

edit 1: Using withHtmlContent instead of withW3cDocument has the same problem.

Omid Nazifi
  • 5,235
  • 8
  • 30
  • 56
Paflow
  • 2,030
  • 3
  • 30
  • 50
  • 1
    Just tried to reproduce your issue, but it works fine for me. With your exact code, the PDF generated has 176 pages, with 10k clickable links. I used `jsoup 1.11.2` and `openhtmltopdf-pdfbox-0.0.1-RC11`. – obourgain Nov 27 '17 at 16:47
  • 1
    Same for me, works perfectly with `openhtmltopdf-pdfbox-0.0.1-RC12`. Seems you need to share the versions of libraries which you use. – Babl Nov 28 '17 at 01:44
  • Okay with a update the Problem is gone. I used RC8 and 1.10.2 before. What do I do with my bounty now? – Paflow Nov 29 '17 at 14:09
  • I'd suggest giving it to the the poster who put the effort in to reproduce your issue initially. @obourgain - consider answering the question to claim a bounty? – Phil Dec 04 '17 at 09:15

1 Answers1

0

The generated PDF works perfectly with jsoup 1.11.2 and openhtmltopdf-pdfbox-0.0.1-RC11.

The problem is likely caused by a bug in an older version of openhtmltopdf, which has been fixed.

obourgain
  • 8,856
  • 6
  • 42
  • 57