Extract href from tag with class uses Regex

Question

I need to read many pages from a website and extract all links with class "active" using a regex. This tags can have the class attr BEFORE or AFTER the HREF value.

My code is:

    try:
        p = requests.get(url, timeout=4.0)
    except:
        p = None
    if p and p.content and p.status_code < 400:
        canonical_url = re.search('<a class="active" href="(.*)?"', p.content, flags=re.MULTILINE|re.IGNORECASE|re.DOTALL|re.UNICODE)

but with this regex I can catch only links with class active BEFORE the HREF and not AFTER. Thanks.

Looks like python, you should tag this [tag:python]. Also, don't use regex. [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) Use BeautifulSoup. — ctwheels, Feb 07 '18 at 16:56
trying to understand why you would need to use a regex, could you give us more context around that please. — Haleemur Ali, Feb 07 '18 at 17:07
If it's a single site, and the structure is known (and regular), pulling out hrefs by using a regular expression is perfectly fine. — KyleFairns, Feb 07 '18 at 17:12
I used BS4, but my boss asked me to use regex because BS4 is an overkill to extract a simple link. :) — Paul Iulius, Feb 07 '18 at 17:15

ctwheels · Answer 1 · 2018-02-07T17:29:35.310

Given that the OP specified the following in the comments below the question, regex may be used. Be careful though as regex can easily break when trying to parse HTML.

I used BS4, but my boss asked me to use regex because BS4 is an overkill to extract a simple link

See regex in use here

<a\b(?=[^>]* class="[^"]*(?<=[" ])active[" ])(?=[^>]* href="([^"]*))

<a Match this literally
\b Assert position as a word boundary
(?=[^>]* class="[^"]*(?<=[" ])active[" ]) Positive lookahead ensuring the following is matched.
- [^>]* Match any character except > any number of times
- class=" Match this literally
- [^"]* Match any character except " any number of times
- (?<=[" ]) Positive lookbehind ensuring what precedes is a character in the set
- active Match this literally
- [" ] Match either character in the set
(?=[^>]* href="([^"]*)) Positive lookahead ensuring what follows matches
- [^>]* Match any character except > any number of times
- href=" Match this literally
- ([^"]*) Capture any character except " any number of times into capture group 1

Given the following samples, only the first 3 are matched:

<a class="active" href="something">
<a href="something" class="active">
<a href="something" class="another-class active some-other-class">

<a class="inactive" href="something">
<a not-class="active" href="something">
<a class="active" not-href="something">

Works perfectly! For Python I changed ]* class=\"[^\"]*(?<=[\" ])active[\" ])(?=[^>]* href=\"([^\"]*)) — Paul Iulius, Feb 07 '18 at 18:14

Extract href from tag with class uses Regex

1 Answers1