Extract string from PDF that contains a URL

Question

I have a PDF document with a few hyperlinks in it, and I need to extract the text/string from the pdf that contains a url. I have used the PyPDF2 and PyPDF4.

I am able to extract the urls but unable to extract the string that contains the url.

For example, I have text that says Check this link out, with a link attached to it in PDF. I am able to extract the link https://stackoverflow.com bu I also need Check this link out.

import PyPDF4
import requests

# Open the PDF file
pdf_file = open('abc.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF4.PdfFileReader(pdf_file)

# Loop through each page of the PDF
for page_num in range(len(pdf_reader.pages)):
    # Get the page object
    page = pdf_reader.pages[page_num]

    # Extract the annotations from the page
    annotations = page.get('/Annots')

    # If there are no annotations, skip to the next page
    if not annotations:
        continue

    # Loop through each annotation
    for annotation in annotations:
        # Get the annotation dictionary
        annotation_dict = annotation.getObject()

        # If the annotation is a link, extract the URL and its associated string
        if annotation_dict.get('/Subtype') == '/Link':
            url_dict = annotation_dict.get('/A')

            if url_dict is not None:
                url = url_dict.get('/URI')
                url_string = annotation_dict.get('/Contents')

                if url is not None:
                    # Check if the URL is working or broken
                    try:
                        response = requests.get(url)

                        if response.status_code == 200:
                            print(f"Page {page_num + 1}: URL - {url}\nString - {url_string}\nWorking fine!")
                        else:
                            print(f"Page {page_num + 1}: URL - {url}\nString - {url_string}\nBroken!")
                    except requests.exceptions.RequestException as e:
                        print(f"Page {page_num + 1}: URL - {url}\nString - {url_string}\nBroken! Error: {e}")

# Close the PDF file
pdf_file.close()

Currently, in the following script, I am geeting the following result for string:

String - None

Also tried all the codes available here:

https://stackoverflow.com/a/49614726/6697295.

[Check this out](https://stackoverflow.com/a/49614726/6697295). It seems the text must be extracted by getting the text that overlaps the rect of the href — Elias, Jun 30 '23 at 13:17
@Elias Thank you for your response I tried the code available in the link you provided but it is not working. I can only get URLs but I want both URL and text. — Sandy, Jul 01 '23 at 14:04

K J · Answer 1 · 2023-07-10T17:06:44.380

The process is theoretically simple which ever applications you use, the problem is finding the inter relationships.

First the contents need decoding to be searchable for the URI data, then that entry needs to back link to the surface locations, in this case 2 words then one, but how do we know it is that location? The URI does not say that Page as it's page-less. So we backtrack where first URI /Annots[46 0 R is included in this Page 42 0 obj.

46 0 obj
<</A<</S/URI/URI(https://www.abc.com.html#C_Concept.dita_da0d66f6-405e-4db6-8845-1578e253f5bd)>>/Subtype/Link/Border[0 0 0]/Rect[132.6 305.141 187.28 314.233]>>
endobj

42 0 obj
<</Contents[43 0 R 44 0 R]/BleedBox[0 0 612 792]/Type/Page/Resources 45 0 R/Parent 7 0 R/CropBox[0 0 612 792]/Annots[46 0 R 47 0 R 48 0 R 49 0 R 50 0 R]/MediaBox[0 0 612 792]/TrimBox[0 0 612 792]>>
endobj

Likewise that page is listed as second entry of /Kids[3 0 R 42 0 R among pages

7 0 obj
<</Kids[3 0 R 42 0 R 51 0 R 69 0 R 84 0 R 104 0 R 111 0 R 116 0 R 130 0 R 146 0 R]/Type/Pages/Count 10/Parent 6 0 R>>
endobj

So now we know we are looking for those words on page 2 at that location. And to avoid doing the same all over again (as a human intuition) if the next URI is slight lower it must be the lower word Location

Thus:

47 0 obj
<</A<</S/URI/URI(https://www.abc.html#C_Concept.dita_da0d66f6-405e-4db6-8845-1578e253f5bd)>>/Subtype/Link/Border[0 0 0]/Rect[132.6 293.141 190.85 302.233]>>
endobj

should be Enhancements.

So C#, CMD, JS, VBA, Python or PyPDF the loops are just the same as per human.

The greater remaining challenge is, defining a copy and paste function at that location on Page2 and that is where perhaps PyMuPDF may have some better rect handling. However beware, units in Y can there be needing reversal, raising yet another challenge.

As an example here is that area as defined by a MuTool "Trace" page 2 and we see X=132.6 however if we searched for Y=305.141 we would miss finding Y=307.299. Thus aligning Y is a question of setting a known tolerance/range. For example 290 to 310.

        <span font="GFEDCB+TimesNewRomanPSMT" wmode="0" bidi="0" trm="10 0 0 10">
            <g unicode="A" glyph="A" x="132.6" y="307.299" adv=".72216799"/>
            <g unicode="S" glyph="S" x="139.82" y="307.299" adv=".55615237"/>
            <g unicode="E" glyph="E" x="145.38" y="307.299" adv=".61083987"/>
            <g unicode="L" glyph="L" x="153.98" y="307.299" adv=".61083987"/>
            <g unicode="o" glyph="o" x="160.09" y="307.299" adv=".5"/>
            <g unicode="a" glyph="a" x="165.09" y="307.299" adv=".44384767"/>
            <g unicode="d" glyph="d" x="169.53" y="307.299" adv=".5"/>
            <g unicode="i" glyph="i" x="174.53" y="307.299" adv=".27783204"/>
            <g unicode="n" glyph="n" x="177.31" y="307.299" adv=".5"/>
            <g unicode="g" glyph="g" x="182.31" y="307.299" adv=".5"/>
            <g unicode="E" glyph="E" x="132.6" y="295.299" adv=".61083987"/>
            <g unicode="n" glyph="n" x="138.71" y="295.299" adv=".5"/>
            <g unicode="h" glyph="h" x="143.71" y="295.299" adv=".5"/>
            <g unicode="a" glyph="a" x="148.71" y="295.299" adv=".44384767"/>
            <g unicode="n" glyph="n" x="153.15001" y="295.299" adv=".5"/>
            <g unicode="c" glyph="c" x="158.15001" y="295.299" adv=".44384767"/>
            <g unicode="e" glyph="e" x="162.59001" y="295.299" adv=".44384767"/>
            <g unicode="m" glyph="m" x="167.03002" y="295.299" adv=".77783206"/>
            <g unicode="e" glyph="e" x="174.81002" y="295.299" adv=".44384767"/>
            <g unicode="n" glyph="n" x="179.25002" y="295.299" adv=".5"/>
            <g unicode="t" glyph="t" x="184.25002" y="295.299" adv=".27783204"/>
            <g unicode="s" glyph="s" x="187.03002" y="295.299" adv=".38916017"/>
        </span>

So we could use a command line to read and display those values. This is not the easiest, it just shows it's possible. However there are simpler ways you can program using Python libraries direct.

And thus we come to the simplest answer

of all which is to search a HTML reproduction of the page(s) and tidy up that result by splitting off the unwanted head and tail then replace the   with space characters.

\poppler\bin>pdftohtml -noframes -hidden abc.pdf 1>nul 2>nul

\poppler\bin>type abc.html |find /i "https:"|more
<a href="https://www.abc.html#C_Concept.dita_da0d66f6-405e-4db6-8845-1578e253f5bd">ASE&#160;Loading</a><br/>
<a href="https://www.abc.html#C_Concept.dita_da0d66f6-405e-4db6-8845-1578e253f5bd">Enhancements</a><br/>
<a href="https://www.abc.html#C_Concept.dita_ff4df4be-c3c5-494c-9d84-3d5ed1409863">PROD&#160;2134&#160;L-Fan</a><br/>
<a href="https://www.abc.html#C_Concept.dita_ff4df4be-c3c5-494c-9d84-3d5ed1409863">and&#160;C+L-Fan</a><br/>
<a href="https://www.abc.html#C_Concept.dita_ff4df4be-c3c5-494c-9d84-3d5ed1409863">Support.</a><br/>
<a href="https://www.abc.html#omnidirectional--support1">LAT&#160;Based</a><br/>
<a href="https://www.abc.html#omnidirectional--support1">Omnidirectional</a><br/>
<a href="https://www.abc.html#omnidirectional--support1">Add/Drop&#160;Topology&#160;</a>omnidirectional&#160;add/drop&#160;can&#160;be&#160;implemented&#160;on&#160;a&#160;node&#160;to&#160;provide&#160;redundancy&#160;or<br/>
<a href="https://www.abc.html">SNMP&#160;MIB&#160;support&#160;</a>The&#160;standard&#160;optical&#160;Simple&#160;Network&#160;Management&#160;Protocol&#160;(SNMP)&#160;Management<br/><a href="https://www.abc.html">is&#160;enabled&#160;in&#160;PROD</a><br/>
<a href="https://www.abc.html">2134</a><br/>
PROD&#160;2134&#160;platform.&#160;Also,&#160;from&#160;R7.9.1&#160;Release&#160;onwards&#160;specific&#160;support&#160;is&#160;enabled<br/>for&#160;the&#160;OTS&#160;SNMP&#160;MIB.&#160;See&#160;<a href="https://cfnng.abc.com/mibs">C&#160;SNMP&#160;MIBs&#160;</a>for&#160;details.<br/>
<a href="https://www.abc.html">APC&#160;enhancements</a><br/>
<a href="https://www.abc.html">Fan&#160;Failure</a><br/>
<a href="https://www.abc.html">Reco&#160;(BFR)</a><br/>
<a href="https://www.abc.html#c-ILT-l">PRODK-ILT-L&#160;Line&#160;</a>The&#160;new&#160;PRODK-ILT-L&#160;line&#160;card&#160;for&#160;the&#160;PROD&#160;2134&#160;optical&#160;line&#160;system&#160;amplifies<br/><a href="https://www.abc.html#c-ILT-l">Card</a><br/>
<a href="https://www.abc.html#c-LAT-l">PRODK-LAT-L&#160;Line&#160;</a>The&#160;new&#160;PRODK-LAT-L&#160;line&#160;card&#160;for&#160;the&#160;PROD&#160;2134&#160;optical&#160;line&#160;system&#160;performs&#160;the<br/><a href="https://www.abc.html#c-LAT-l">Card</a><br/>
<a href="https://www.abc.html#C_Concept.dita_782e3652-dcc9-49f9-8181-9f2f30a46ae4">Port&#160;Status&#160;for</a><br/>
<a href="https://www.abc.html#C_Concept.dita_782e3652-dcc9-49f9-8181-9f2f30a46ae4">Breakout&#160;Modules</a><br/>
<a href="https://www.abc.html#C_Concept.dita_dce6c2f9-fc32-43e0-aec5-abc2d4426347">Port&#160;Status&#160;for</a><br/>
<a href="https://www.abc.html#C_Concept.dita_dce6c2f9-fc32-43e0-aec5-abc2d4426347">Mux/Demux&#160;Patch</a><br/>
<a href="https://www.abc.html#C_Concept.dita_dce6c2f9-fc32-43e0-aec5-abc2d4426347">Panel</a><br/>
-- More  --

Extract string from PDF that contains a URL

1 Answers1

And thus we come to the simplest answer