How to match the strings with special characters using regex

Question

I am trying to scrape website data using BS4 but can't write the exact statement to grab the link required. I want to get the link to the searched resource which should be in

<a href="www.speed.org">Speed Org</a>

The code I have written to do this is:

r = re.compile(r'^<a(.)*speed.org(.)*</a>$')

I want the code to display:

<a href="www.speed.org">Speed Org</a>

But it is not giving proper output. Can anyone please fix this code.

Edit:

Someone pointed out that the expression itself is wrong. The correct expression should be: r'^<a(.*)speed.org(.*)</a>$' Since I was using BS4, it was easier to get the result using soup.

Thanks to all for help. :)

Don't use regex to parse HTML. Chtulhu will eat your kittens and Zalgo will come for you. — tripleee, Jan 24 '18 at 18:04
Are you trying to grab the entire tag or just the href value of the tag? — Ryan Wilson, Jan 24 '18 at 18:05
@RyanWilson Only the value of href. If there is a better way to do it, kindly suggest. — noobita, Jan 24 '18 at 18:06
Why is there a parenthesis around `(.)*` and which character do you expect to end up being captured? (Hint: In Python it will contain the last matched character from the repetition.) — tripleee, Jan 24 '18 at 18:06
I understood the (.*) and (.), thanks for that. :D @RyanWilson Thanks a lot. It works for me. :) — noobita, Jan 24 '18 at 18:10
[H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — ctwheels, Jan 24 '18 at 18:16

score 2 · Accepted Answer · answered Jan 24 '18 at 18:10

If you're already using BeautifulSoup, don't treat the HTML as a string. Let BeautifulSoup parse it and then use BeautifulSoup.find_all to search for your elements:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(your_html, 'lxml')
links = soup.find_all('a', href=re.compile('www\.speed\.org'))

href=re.compile('www\.speed\.org') just uses a regex to narrow down the links to those whose href attribute matches the regex.

How to match the strings with special characters using regex

1 Answers1