I have several PDFs that were generated with Microsoft Word. I want to:
- Use a regex to find matches in the PDF text.
- Convert the matching text to a link that points to an external URL.
- Save the new version of the PDF.
If I were doing this in HTML, it would look like this:
<!-- before: -->
This is the text to match.
<!-- after: -->
This is the text to <a href="http://www.match.com/" target="_blank">match</a>.
How can I do this to a PDF?
I'd prefer Python, but I'm open to alternatives.
Edit: I don't have access to the original Word documents. I need to manipulate the PDFs themselves. I'm looking for a technique using a Python PDF library (or something similar in another language).
Edit 2: I understand that the source code of a PDF doesn't contain literal strings. I'm wondering if there's an approach that could do something like: (1) extract the text, (2) find matches, and (3) for each match, draw a clickable box around the position of the text in the original PDF. The closest I've come is PyPDF2's addLink()
, but that adds internal links in the PDF, not links to external URLs.