1

I have a list of links stored as a LIST. But I need to extract only the PDF links.

    links = [ '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>', '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-link-4ea4-8f1c-dd36a1f55d6f.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>']

So I need to extract only the link starting from 'https' and and ending with pdf as given below

    https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf

And store this link in a list. There are many pdf links in the variable 'links'. Need to store all the pdf links in the variable named 'pdf_links'

Can anyone suggest me regular expression to extract this pdf link ? I have used the below regular expression but its not working.

    pdf_regex = r""" (^<a\sclass="tablebluelink"\shref="(.)+.pdf"$)"""
Latika Agarwal
  • 973
  • 1
  • 6
  • 11
IamBatman
  • 85
  • 1
  • 10

3 Answers3

3

Everybody will tell you that's wrong to process HTML using regex. Instead of showing you anyways how it can be done that way I would like to show you how easy it actually is to parse HTML with a library, e.g. BeautifulSoup 4 which is often recommended.

To keep it simple and close to your sample code, I just flatten your input list. Usually, you would feed the raw HTML directly to the parser (e.g. see here).

from bs4 import BeautifulSoup
links = [ '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-b4df-16t9g8p93808.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>', '<a class="tablebluelink" href="https://www.samplewebsite.com/xml-data/abcdef/higjkl/Thisisthe-required-document-link-4ea4-8f1c-dd36a1f55d6f.pdf" target="_blank"><img alt="Download PDF" border="0" src="../Include/images/pdf.png"/></a>']

soup = BeautifulSoup(''.join(links), 'lxml')
for link in soup.find_all('a', href=True):
    if link['href'].lower().endswith(".pdf"):
        print(link['href'])

Easy and straightforward, isn't it?

wp78de
  • 18,207
  • 7
  • 43
  • 71
1

As Daniel Lee pointed out, regular expressions are not suitable for parsing HTML. However, as long as your HTML follows certain patterns for all cases, something like this should do the trick (obviously, just in a sandbox environment):

import re

pdf_links = map(lambda extracted_link: extracted_link.group(1),
                filter(lambda extracted_link: extracted_link \
                is not None, map(lambda link: \
                re.search(r'.*href=\"([^\"]+\.pdf)\".*', link,
                re.IGNORECASE), links)))
0

Firstly, you should NEVER parse html with regex.

"Parsing html with regex is like asking a beginner to write an operating system"

This answer is famous and forever relevent: RegEx match open tags except XHTML self-contained tags

It's probably worthwhile to take an hour and learn how matching groups work in regex. But, this may help:

Firstly, links is a list. Which means you either need to loop through it or (in this case) you need to take the first element.

try

 import re
 r = re.match(regex, lists[0])
 if r:
     print(r.group(1))
Daniel Lee
  • 7,189
  • 2
  • 26
  • 44
  • Shouldn't this be done using `bs4` instead of a `regex`? – Austin Jun 20 '18 at 06:07
  • Absolutely. That's not the question though. And that regex should work for that example. – Daniel Lee Jun 20 '18 at 06:09
  • I used Beautiful soup to extract all the links from the web page ... but its not helpful while extracting only Pdf file link . If you can suggest a way using bs4 to extract the pdf link directly... i am open to suggestions... – IamBatman Jun 20 '18 at 06:46