find all links in html parsed beautiful soup

Question

I am using beautifulsoup with python. In scrapping pages links are not enclosed in <a href> tags.

I want to get all links starting with http/https using soup operation. I have tried some regex given here but they are giving unexpected results for me. so i thought if anything is possible using soup?

Example responses from which i want to get links:

<html>\n<head>\n</head>\n<link href="https://fonts.googleapis.com/css?family=Open+Sans:600" rel="stylesheet"/>\n<style>\n    html, body {\n    height: 100%;\n    width: 100%;\n    }\n\n    body {\n    background: #F5F6F8;\n    font-size: 16px;\n    font-family: \'Open Sans\', sans-serif;\n    color: #2C3E51;\n    }\n    .main {\n    display: flex;\n    align-items: center;\n    justify-content: center;\n    height: 100vh;\n    }\n    .main > div > div,\n    .main > div > span {\n    text-align: center;\n    }\n    .main span {\n    display: block;\n    padding: 80px 0 170px;\n    font-size: 3rem;\n    }\n    .main .app img {\n    width: 400px;\n    }\n  </style>\n<script type="text/javascript">\n      var fallback_url = "null";\n      var store_link = "itms-apps://itunes.apple.com/GB/app/id1032680895?ls=1&mt=8";\n      var web_store_link = "https://itunes.apple.com/GB/app/id1032680895?mt=8";\n      var loc = window.location;\n      function redirect_to_web_store(loc) {\n        loc.href = web_store_link;\n      }\n      function redirect(loc) {\n        loc.href = store_link;\n        if (fallback_url.startsWith("http")) {\n          setTimeout(function() {\n            loc.href = fallback_url;\n          },5000);\n        }\n      }\n  </script>\n<body onload="redirect(loc)">\n<div class="main">\n<div class="workarea">\n<div class="logo">\n<img onclick="redirect_to_web_store(loc)" src="https://cdnappicons.appsflyer.com/app|id1032680895.png" style="width:200px;height:200px;border-radius:20px;"/>\n</div>\n<span>BetBull: Sports Betting &amp; Tips</span>\n<div class="app">\n<img onclick="redirect_to_web_store(loc)" src="https://cdn.appsflyer.com/af-statics/images/rta/app_store_badge.png"/>\n</div>\n</div>\n</div>\n</body>\n</html>

Tried:

regex_pattern_to_find_all_links = r'(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+'
soup = BeautifulSoup(resp.read(), 'html.parser')
urls = re.findall(regex_pattern_to_find_all_links, str(soup))

Result:

['https://fonts.googleapis.com/css?family=Open', '//itunes.apple.com/GB/app/id1032680895?ls=1', 'https://itunes.apple.com/GB/app/id1032680895?mt=8', 'window.location', 'loc.href', 'loc.href', 'fallback_url.startsWith', 'loc.href', 'https://cdnappicons.appsflyer.com/app', 'id1032680895.png', 'https://cdn.appsflyer.com/af-statics/images/rta/app_store_badge.png']

As you can see above, I am not sure why regex is matching things which are not even urls.

What I have tried. Most upvoted and accepted answer here is not able to detect links at all!! I am not sure what i am doing wrong,

As the first part use this instead `(?:(?:https?|ftp):\/\/|\bwww\.)` — revo, May 24 '18 at 12:04
It seems it worked for you so I posted it as an answer below. — revo, May 24 '18 at 12:15
yes now i am trying to ignore links which have jpg png etc as ending on it. — Kishan Mehta, May 24 '18 at 12:32

score 1 · Accepted Answer · answered May 24 '18 at 12:14

The problem is with protocol that you made optional and engine isn't forced to match it if it is satisfied with the rest of patterns. Try this instead:

(?:(?:https?|ftp):\/\/|\bwww\.)[^\s"']+

Not bulletproof but much better. It matches strings starting with https? or ftp or those with no protocols but www.

See live demo here

find all links in html parsed beautiful soup

1 Answers1