-1

I am trying to scrape website data using BS4 but can't write the exact statement to grab the link required. I want to get the link to the searched resource which should be in

<a href="www.speed.org">Speed Org</a>

The code I have written to do this is:

r = re.compile(r'^<a(.)*speed.org(.)*</a>$')

I want the code to display:

<a href="www.speed.org">Speed Org</a>

But it is not giving proper output. Can anyone please fix this code.

Edit:

Someone pointed out that the expression itself is wrong. The correct expression should be: r'^<a(.*)speed.org(.*)</a>$' Since I was using BS4, it was easier to get the result using soup.

Thanks to all for help. :)

noobita
  • 67
  • 11

1 Answers1

2

If you're already using BeautifulSoup, don't treat the HTML as a string. Let BeautifulSoup parse it and then use BeautifulSoup.find_all to search for your elements:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(your_html, 'lxml')
links = soup.find_all('a', href=re.compile('www\.speed\.org'))

href=re.compile('www\.speed\.org') just uses a regex to narrow down the links to those whose href attribute matches the regex.

Blender
  • 289,723
  • 53
  • 439
  • 496