4

I have several PDFs that were generated with Microsoft Word. I want to:

  1. Use a regex to find matches in the PDF text.
  2. Convert the matching text to a link that points to an external URL.
  3. Save the new version of the PDF.

If I were doing this in HTML, it would look like this:

<!-- before: -->
This is the text to match.

<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.

How can I do this to a PDF?

I'd prefer Python, but I'm open to alternatives.

Edit: I don't have access to the original Word documents. I need to manipulate the PDFs themselves. I'm looking for a technique using a Python PDF library (or something similar in another language).

Edit 2: I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's addLink(), but that adds internal links in the PDF, not links to external URLs.

Joe Mornin
  • 8,766
  • 18
  • 57
  • 82
  • It may be a better idea to do this in the original Word document. For instance, your first point "Use a regex to find matches in the PDF text" is already not really suited to operate on a PDF. – Jongware Mar 01 '15 at 18:00
  • I don't have access to the original Word documents. I only have the PDFs. – Joe Mornin Mar 01 '15 at 18:01
  • 6
    I don't "get" why some SO users would vote this question down and vote to even *close* it. Because they themselves do not know an answer?!? The potential answers for this problems are not likely to create controversy or negative effects for this platform. So why? – Kurt Pfeifle Mar 04 '15 at 17:28
  • Links in PDF are annotations. If solution from 1.5 years ago works i.e. adds highlight annotations where you want links, then that code requires only very minor modification (though I'd re-write it cleaner as I'm looking at it now, but it's another story) and, really, not much effort from you. How did you plan to use *Python PDF library* without opening PDF Reference? – user2846289 Mar 08 '15 at 19:20
  • What if you did this using OCR at least for adding internal links to unlinked PDF document? – MathCrackExchange Oct 04 '22 at 04:19

3 Answers3

5

1. 'regex' approach won't work!

What you 'want', ('use regex to find matches in PDF') is not possible! Plain and simple answer.

Reasons:

For the general case, you cannot use regexes in order to find 'matches' in a PDF text. And I will not even talk about Unicode characters here...

I'll only consider the simple string of text from the example in your question: match.

In PDF source code, this string could be present in different incarnations, depending on the PDF generating software as well as on the exact font with font encoding being used. The following listing is not complete!

(match) Tj                       # you are lucky
<6d61746365> Tj                  # hex representation of characters
<6d 61 74 63 65> Tj              # hex representation of characters, v2
<6d   61 7463   65> Tj           # hex representation of characters, v3
<6d>Tj <61>   Tj<746365>Tj       # hex representation of characters, v4
....                             # skipping version 5-500000000 of all... 
                                         # ...possible hex representations
(\155\141\164\143\150) Tj        # octal representation of characters
(m\141\164ch) Tj                 # octal/ascii mixed representation of chars
(\155a\164ch) Tj                 # octal/ascii mixed representation of chars, v3
<6d 61>Tj (\164c\150) Tj         # hex/octal/ascii mix
....                             # skipping many more possibilities

It gets more complicated even, if the font the string should be using does use a custom encoding (as is the case when the font is embedded in the PDF as a subset -- only containing these glyphs which are used in the respective text).

This could mean that what was <6d61746365> Tj above could become <2234567111> Tj with the custom encoded font, but it will still display match on the PDF page.


2. Workarounds to achieve similar results may work

  1. You can use pdftotext -layout some.pdf some.txt to create a file containing the text from your PDF. (This does not work reliably. Some PDFs, for example those which are missing a valid /ToUnicode table, will not lend themselves readily to text extraction.)

    This can lead you to the page number for a match.

    Using (with some trial'n'error) pdftotext -f 33 -l 33 -layout -x NN -y MM -W NN -H MM can narrow down the location of your match on page 33 more exactly.

    Using pdftotext -layout -bbox -f 33 -l 33 will return the coordinates of the bounding boxes for each word on page 33.

  2. You could use TET, the Text Extraction Toolkit to find the exact coordinates of matching words too. TET can give you the coordinates of individual glyphs even.

  3. Once you have identified the locations of your matches, you may be able to employ PDFlib to add your links.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Right. I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's `addLink()`, but that adds _internal_ links in the PDF, not links to external URLs. – Joe Mornin Mar 04 '15 at 16:24
  • well it can be done but it is extremely complicated. One would need to implement quite alot of code. – joojaa Mar 04 '15 at 16:39
  • @Kurt Yes, that's what I'm saying. See my question from 1.5 years ago: https://stackoverflow.com/questions/19414763/detect-and-alter-strings-in-pdfs I'll upvote you for your thorough response, but I'll leave the bounty open for now because it doesn't answer the question. If nobody comes up with a viable approach, I'll give you the bounty. – Joe Mornin Mar 04 '15 at 17:17
  • [My comment replied to a comment that was deleted, which said something like: "Are you saying that you knew that a PDF doesn't contain literal strings _before_ you read my answer?"] – Joe Mornin Mar 07 '15 at 18:20
5

I have solved this. Appreciate anyone cleaning up any errors. https://github.com/JohnMulligan/PyPDF2/tree/URI-linking

Because Kurt answered most of parts 1 and 2, I'm going to restrict my answer to part 3 of the original question: how to add external links to a PDF. (I have a fully working answer to 1 & 2, but it's inelegant. If people want it, I'll post that, too.)

My branch of PyPDF2 has addURI functionality, that works in the same way as the package's original addLink().

Specifically: With a rectangles dictionary that has has pagenumber keys:

rectangles_dictionary = {0:{'key1':[255, 156, 277, 171],'key2':[293, 446, 323, 461]},1:{'key2':[411, 404, 443, 419]}}

(Rectangle format is [llX, llY, urX, urY]) Now we have rectangles to assign 2 rectangles to page 1, and 1 rectangle to page 2.

Add a URLs dictionary that uses those keys to look up the URLs to assign:

destinations_dictionary = {'key1':'url1','key2':'url2'}

We can then add the appropriate links to all those rectangle zones:

def make_pdf(rectangles_dictionary,destinations_dictionary):
    input = reader(file('pdfs/input_pdf.pdf','rb'))
    output = file('pdfs/output_pdf.pdf','wb')
    result = writer()

    for pagenum in range(0, input.getNumPages()):
        page = input.getPage(pagenum)
        result.addPage(page)

    for pagenum in rectangles_dictionary.keys():

        for name in rectangles_dictionary[pagenum].keys():
            for rectangle in rectangles_dictionary[pagenum][name]:

                    destination = destinations_dictionary[name]
                    result.addURI(pagenum, destination, rectangle)

    result.write(output)

Cleaner ways to do the first half there with JSON or something but for my implementation it was the most efficient way.

The key line of course is this one:

result.addURI(pagenum, destination, rectangle)

With pagenum as int(), destination as str(), and rectangle as list()

Paul Roub
  • 36,322
  • 27
  • 84
  • 93
John M.
  • 108
  • 2
  • 7
  • 2
    How do you find the rectangle values to use for a given text ?? – solsTiCe Oct 22 '17 at 16:31
  • 1
    I can't find my old code for that part. But Kurt's answer below points in the right direction. If I remember correctly how I did it, you should extract every single text character, recording their x/y coords and using the font data for height/width deltas. Then compile these into words with bounding boxes from those coords. Then you can search on the text you've built for matches, and return the bounding boxes. I think! – John M. Nov 01 '17 at 18:52
  • This question is old and closed, but for those looking for solutions to adding external links to PDF's, try [PyFPDF](https://pyfpdf.readthedocs.io/en/latest/index.html) in which the function [`fpdf.link`](https://pyfpdf.readthedocs.io/en/latest/reference/link/index.html) can link to external URL's. – Chris Collett Jan 15 '21 at 17:02
0

As PDF is a binary format, regexes are not the right approach to this problem. You need to use a python pdf library that can read and write pdf files. A starting point could be this SO question.

Community
  • 1
  • 1
Lorenz Meyer
  • 19,166
  • 22
  • 75
  • 121