Extract link from string

Question

I am currently trying to find out the way how to efficiently extrant substrings from my file in Python. I have a file with extracted html code

<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>

but mostly I am failing with extraction. My goal is to have another txt file with extraxted clear link "/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" without HTML syntaxes. Means strast with /archiv and ends with .pdf

I tried to explore for each method and regex, but not so lucky since I am begginer. I would be happy for any advice.

The way you do this is with an HTML parser like BeautifulSoup. You then look for all of the `` tags and extract their `href` attributes. — Tim Roberts, Aug 16 '22 at 20:55
You can find the answer here: https://stackoverflow.com/a/73374424/17845381 Maybe dublicate. — vovakirdan, Aug 17 '22 at 06:48
Note: **NEVER USE REGEX TO PARSE HTML/XML**. See https://stackoverflow.com/q/1732348/17845381 — vovakirdan, Aug 17 '22 at 06:50

score 0 · Answer 1 · answered Aug 16 '22 at 21:00

Use the urllib.parse.urlparse function to parse the URL. Here's an example:

from urllib.parse import urlparse

url_str = 'https://example.com'
url_obj = urlparse(url_str)

if not (url_obj.scheme and url_obj.path): # validity check
  print(f'The URL {url_str} is invalid!')
else:
  print(f'The URL {url_str} is valid!')

Michael Richo · Answer 2 · 2022-08-16T21:44:54.407

Using regular python we can do this easily without any libraries:

text = """
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
<td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
<td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>
"""

links = [line.split('<a href="')[1].split('"')[0] for line in text.split('\n') if '<a href="' in line]

print(links)

The output:

['/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf', '/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf', '/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf']

What this does is split the file by '\n' and then returns the text between the quotes in the href= section for each line. It creates an array called 'links'

To write the array to a file, with each link on one line:

f = open('test.txt', 'w')
for link in links:
    f.write(link + '\n')
f.close()

score 0 · Answer 3 · answered Aug 16 '22 at 23:21

re.findall can help....

t = '''

<td><a href="/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf" target="_blank">Jitka Horáková</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf" target="_blank">Bohumil Tobolka</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf" target="_blank">Stanislava Rousová, Ing.</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf" target="_blank">Ladislav Macháč</a></td>
                                    <td><a href="/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf" target="_blank">Dana Macháčová</a></td>

'''

print(re.findall(r'\/archiv.*?pdf', t))

['/archiv/zivotopisy/2022/6/Zivotopis-OJVLA-20220624132548.pdf', '/archiv/zivotopisy/2022/6/Zivotopis-XUBIC.pdf', '/archiv/zivotopisy/2022/5/Zivotopis-UNBLA.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-NYBCF-20220407134152.pdf', '/archiv/zivotopisy/2022/4/Zivotopis-PVDPA.pdf']

Extract link from string

3 Answers3