3

I use imagemagick to render a PDF (generated by pdfLaTex) as an image:

convert -density 120 test.pdf -trim test.png

Then I use this image in an HTML file (in order to include latex code in an own wiki engine).

But of course, the PNG file doesn't have any hyperlink the PDF file contains.

Is there any possibility to extract the coordinates and target URLs of the hyperlinks too, so I can build a HTML image map?

If it makes a difference: I only need external (http://) hyperlinks, no PDF-internal hyperlinks. A text-based solution like pdftohtml would be unacceptable, since the PDFs contain graphics and formulars too.

leemes
  • 44,967
  • 21
  • 135
  • 183
  • I've got a pretty similar case. I receive PDF files that possibly contain hyperlinks that are clickable (and open a webpage) when viewing the file in PDF viewer like Acrobat read or evince. I use Ghostscript to convert PDF contents to bitmap images for later (pre)viewing in a webapp. I want to show the hyperlinks and their respective hotspots overlaid on the image when showing it on my webapp. For that I'd need extract the link urls and the hotspot rectangles from PDF. – Tero Tilus Sep 28 '14 at 13:45

2 Answers2

2

Imagemagick uses Ghostscript to render the PDF file to an image. You could also use Ghostscript to extract the Link annotations. In fact the PDF interpreter already does this for the benefit of the pdfwrite device, so that it can produce PDF files with the same hyperlinks as the original.

You would need to do a small amount of PostScript programming, let me know if you want some more details.

In gs/Resource/Init the file pdf_main.ps contains large parts of the PDF interpreter. In there you will find this:

  /Link {
    mark exch
    dup /BS knownoget { << exch { oforce } forall >> /BS exch 3 -1 roll } if
    dup /F knownoget { /F exch 3 -1 roll } if
    dup /C knownoget { /Color exch 3 -1 roll } if
    dup /Rect knownoget { /Rect exch 3 -1 roll } if
    dup /Border knownoget {
....
    } if
    { linkdest } stopped 

That code processes Link annotations (the hyperlinks in the PDF file). You could replace the 'linkdest' with PostScript code to write the data to a file instead, which would give you the hyperlinks. Note that you would also need to set -dDOPDFMARKS on the command line, as this kind of processing is usually disabled for rendering devices, which can't make use of it.

leemes
  • 44,967
  • 21
  • 135
  • 183
KenS
  • 30,202
  • 3
  • 34
  • 51
  • Thank you. Sounds like a bit of work. Do you have any references which help me writing such a program? If it helps, I can also output a PS file because my source is LaTeX code. – leemes May 15 '12 at 11:08
  • Its all Ghostscript-specific, so no references really. It would be all tied up with the way the GS PDF interpreter works. I've edited my answer to add a few details – KenS May 15 '12 at 14:55
  • @leemes. I am having similar requirement. If you completed this, please paste the code.. – Pearl Nov 28 '13 at 07:34
  • @Pearl I'm sorry, I don't have any news on this. I didn't implement it because the (estimated) effort seems to be too much compared to what I'd get in the end. :( But if *you* have news on this, please let me know. Thanks. :) – leemes Nov 28 '13 at 10:47
  • I'm also interested in getting the links from a pdf file. I already figured out how to filter them from a plain txt file created by ghostscript but there was not the links the thumbnails showed to in it. Is it possible to access them? I don't understand your code above. On my server there only is a ghostscript folder but no gs folder with resource files. Could you please explain how to access the pdf links? – user2718671 Feb 07 '14 at 11:46
  • I assume you are on Windows. You would need to get the GS source from the Git repository in order toe get copies of the Resource files. Then modify them, and point GS to the modified resource files by using the -I switch. – KenS Feb 07 '14 at 14:54
  • Any pointers to relevant PS resources? Would like to give this a shot (if anybody hasn't yet). Haven't done anything with PS before. – Tero Tilus Sep 28 '14 at 13:49
  • 1
    Well there's the PostScript Language Reference Manual (https://www.adobe.com/products/postscript/pdfs/PLRM.pdf) The Blue Book (http://www-cdf.fnal.gov/offline/PostScript/BLUEBOOK.PDF) and the green Book (http://www-cdf.fnal.gov/offline/PostScript/GREENBK.PDF) – KenS Sep 28 '14 at 18:14
  • Thanks @KenS. This time I ended up going another way. See my answer. Although I absolutely need to learn PS some day. – Tero Tilus Sep 29 '14 at 06:50
0

Colleague of mine found a nice lib, PDFMiner, which includes a tools/dumppdf.py which does pretty much, what I need, see http://www.unixuser.org/~euske/python/pdfminer/

There's also another SO question that has an answer for this one, see Looking for a linux PDF library to extract annotations and images from a PDF Apparently pdfreader for Ruby does this too https://github.com/yob/pdf-reader

Community
  • 1
  • 1
Tero Tilus
  • 571
  • 1
  • 3
  • 11