0

I need to read many pages from a website and extract all links with class "active" using a regex. This tags can have the class attr BEFORE or AFTER the HREF value.

My code is:

    try:
        p = requests.get(url, timeout=4.0)
    except:
        p = None
    if p and p.content and p.status_code < 400:
        canonical_url = re.search('<a class="active" href="(.*)?"', p.content, flags=re.MULTILINE|re.IGNORECASE|re.DOTALL|re.UNICODE)

but with this regex I can catch only links with class active BEFORE the HREF and not AFTER. Thanks.

  • 2
    Looks like python, you should tag this [tag:python]. Also, don't use regex. [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) Use BeautifulSoup. – ctwheels Feb 07 '18 at 16:56
  • Thanks for your suggest, but I need to use a regex. – Paul Iulius Feb 07 '18 at 17:06
  • Is there a reason why? Also, change `(.*)?` to `([^"]*)` – ctwheels Feb 07 '18 at 17:07
  • trying to understand why you would need to use a regex, could you give us more context around that please. – Haleemur Ali Feb 07 '18 at 17:07
  • If it's a single site, and the structure is known (and regular), pulling out hrefs by using a regular expression is perfectly fine. – KyleFairns Feb 07 '18 at 17:12
  • 1
    I used BS4, but my boss asked me to use regex because BS4 is an overkill to extract a simple link. :) – Paul Iulius Feb 07 '18 at 17:15

1 Answers1

1

Given that the OP specified the following in the comments below the question, regex may be used. Be careful though as regex can easily break when trying to parse HTML.

I used BS4, but my boss asked me to use regex because BS4 is an overkill to extract a simple link

See regex in use here

<a\b(?=[^>]* class="[^"]*(?<=[" ])active[" ])(?=[^>]* href="([^"]*))
  • <a Match this literally
  • \b Assert position as a word boundary
  • (?=[^>]* class="[^"]*(?<=[" ])active[" ]) Positive lookahead ensuring the following is matched.
    • [^>]* Match any character except > any number of times
    • class=" Match this literally
    • [^"]* Match any character except " any number of times
    • (?<=[" ]) Positive lookbehind ensuring what precedes is a character in the set
    • active Match this literally
    • [" ] Match either character in the set
  • (?=[^>]* href="([^"]*)) Positive lookahead ensuring what follows matches
    • [^>]* Match any character except > any number of times
    • href=" Match this literally
    • ([^"]*) Capture any character except " any number of times into capture group 1

Given the following samples, only the first 3 are matched:

<a class="active" href="something">
<a href="something" class="active">
<a href="something" class="another-class active some-other-class">

<a class="inactive" href="something">
<a not-class="active" href="something">
<a class="active" not-href="something">
ctwheels
  • 21,901
  • 9
  • 42
  • 77