How to match all links in python regex?

Question

I am trying to create a regex that matches the link from a page source. I have text formatted like this:

something here here's a link

<a class="_5syj" href="https://www.here.com/FirstCal?ref=br_rs">First Cal</a><span class="mls _1ccm9 _49"></span><a class="_fasc" href="https://www.here.com/Mall?ref=br_rs">Mall</a><span class="m1ls _1cm9 _49"></span>

I want to get all the links that start with href="https://www.here.com/(.*)?ref=br_rs">

So from the links about, I would get either the entire link, or FIrstCal and Mall (from the link)

Python code:

regex = r'(?<=href="https://www.here.com/).*(?<=?ref=br_rs)'

link = re.findall(regex, str(source))

link

But it's not working.

Any ideas ?

PS: Regex would be the only way to do this. A html parse won't work because the website is not "stable" with it's structure.

Obligatory reference: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/q/1732348) — Martijn Pieters, Feb 26 '14 at 13:07
Don't go there, use a HTML parser instead and make your life that much easier. Regular expressions are not the best tool for HTML parsing. — Martijn Pieters, Feb 26 '14 at 13:07
I know, but I am trying to scrape a website, that changes it's div ids,classes very often, so the only way that I could go is regex. I would like a html parser, but I can't here — icebox19, Feb 26 '14 at 13:08
BeautifulSoup can handle this case easily, by applying regular expressions to the attribute values only. `soup.find_all('a', href=re.compile('https://www.here.com/.*?ref=br_rs'))` for example. — Martijn Pieters, Feb 26 '14 at 13:09

score 3 · Accepted Answer · answered Feb 26 '14 at 13:10

3

Use BeautifulSoup with a regular expression matching just the href contents:

soup.find_all('a', href=re.compile('https://www.here.com/.*?ref=br_rs'))

The parser won't care that the structure is changing, you just need to be precise about what is stable; the links.

Demo:

>>> import re
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''\
... <a class="_5syj" href="https://www.here.com/FirstCal?ref=br_rs">First Cal</a><span class="mls _1ccm9 _49"></span><a class="_fasc" href="https://www.here.com/Mall?ref=br_rs">Mall</a><span class="m1ls _1cm9 _49"></span>
... ''')
>>> soup.find_all('a', href=re.compile('https://www.here.com/.*?ref=br_rs'))
[<a class="_5syj" href="https://www.here.com/FirstCal?ref=br_rs">First Cal</a>, <a class="_fasc" href="https://www.here.com/Mall?ref=br_rs">Mall</a>]

answered Feb 26 '14 at 13:10

Martijn Pieters

1,048,767
296
4,058
3,343

I cannot install BeautifulSoup. I am trying to install it with pip3.3 for python 3.3, but I get an error that unit tests have failed. – icebox19 Feb 26 '14 at 13:13
What's the error? Can you copy paste it for us? Thanks! :) – Ryan O'Donnell Feb 26 '14 at 13:15
@icebox19: what is the error there? I have had no problems installing BeautifulSoup on Python 3.3 with pip. – Martijn Pieters Feb 26 '14 at 13:16
4

@icebox19: you are installing BeautifulSoup 3, which is not compatible with Python 3. Install BeautifulSoup 4 instead: `pip3.3 install beautifulsoup4`. – Martijn Pieters Feb 26 '14 at 13:17

How to match all links in python regex?

1 Answers1