0

I am trying to create a regex that matches the link from a page source. I have text formatted like this:

something here here's a link

<a class="_5syj" href="https://www.here.com/FirstCal?ref=br_rs">First Cal</a><span class="mls _1ccm9 _49"></span><a class="_fasc" href="https://www.here.com/Mall?ref=br_rs">Mall</a><span class="m1ls _1cm9 _49"></span>

I want to get all the links that start with href="https://www.here.com/(.*)?ref=br_rs">

So from the links about, I would get either the entire link, or FIrstCal and Mall (from the link)

Python code:

regex = r'(?<=href="https://www.here.com/).*(?<=?ref=br_rs)'

link = re.findall(regex, str(source))

link

But it's not working.

Any ideas ?

PS: Regex would be the only way to do this. A html parse won't work because the website is not "stable" with it's structure.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
icebox19
  • 493
  • 3
  • 6
  • 15
  • 1
    Obligatory reference: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348) – Martijn Pieters Feb 26 '14 at 13:07
  • 1
    Don't go there, use a HTML parser instead and make your life that much easier. Regular expressions are not the best tool for HTML parsing. – Martijn Pieters Feb 26 '14 at 13:07
  • I know, but I am trying to scrape a website, that changes it's div ids,classes very often, so the only way that I could go is regex. I would like a html parser, but I can't here – icebox19 Feb 26 '14 at 13:08
  • BeautifulSoup can handle this case easily, by applying regular expressions to the attribute values only. `soup.find_all('a', href=re.compile('https://www.here.com/.*?ref=br_rs'))` for example. – Martijn Pieters Feb 26 '14 at 13:09

1 Answers1

3

Use BeautifulSoup with a regular expression matching just the href contents:

soup.find_all('a', href=re.compile('https://www.here.com/.*?ref=br_rs'))

The parser won't care that the structure is changing, you just need to be precise about what is stable; the links.

Demo:

>>> import re
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <a class="_5syj" href="https://www.here.com/FirstCal?ref=br_rs">First Cal</a><span class="mls _1ccm9 _49"></span><a class="_fasc" href="https://www.here.com/Mall?ref=br_rs">Mall</a><span class="m1ls _1cm9 _49"></span>
... ''')
>>> soup.find_all('a', href=re.compile('https://www.here.com/.*?ref=br_rs'))
[<a class="_5syj" href="https://www.here.com/FirstCal?ref=br_rs">First Cal</a>, <a class="_fasc" href="https://www.here.com/Mall?ref=br_rs">Mall</a>]
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343