Finding links in HTML using regex

Question

I'm trying to find all links in a Wikipedia article while excluding fragments (links starting with #).

Initially I was using <a href=\"[^#]\S*?\" which worked fine (although what it captures is a bit messy, I can clean this up later in python). But then I realized that "<a " isn't necessarily directly followed by "href", so I changed the expression to

<a .*?href=\"[^#]\S*?\"

My thought behind this was capture text starting with '<a ', followed by any characters zero to unlimited times until you reach 'href="', then a character that is not '#' followed by zero to unlimited characters that are not whitespace until a quote (") is reached.

Both of these are now captured, which is what I want

<a title="test" href="link"

<a href="link"

And this is not captured, which is also what I want

<a class="class1" href="#fragment">

But this is captured, which I do not want

<a href="#citewnotew1"></a></sup></div></td></tr><tr><th scope="row" style="line-height:1.2em; padding-right:0.65em;"><a href="/wiki/Filename_extension"

Why does this happen?

Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Christian Baumann, Oct 19 '20 at 11:01
Nice. Unfortunately this is part of an assignment in one of my courses, and the task is literally to find links in HTML using regex — Jonas, Oct 19 '20 at 11:03
a course is making you parse HTML with regex? Are they at least making it very clear that the HTML sample will be extremely consistent and low on nesting level? It is not possible to parse HTML with regex reliably for a general case. For specific and consistent cases, it *is possible* - but highly discouraged — Chase, Oct 19 '20 at 11:18
It has not been said or written anything about this generally being a bad idea. But now I know, thanks for making it clear! — Jonas, Oct 19 '20 at 11:33

score 1 · Accepted Answer · answered Oct 19 '20 at 12:05

With ., you're matching all characters, including the closing >.

The non-greedy modifier in .*? means that it will not include the > if it finds a match, but if it doesn't it will include it to try and find a match.

The same goes for \S, which matches all non-space characters including a closing ".

You should explicitly exclude all characters that shouldn't match, and not rely on non-greedy.

<a\s[^>]*\bhref="([^#"][^"]*)"

Explanation

<a matches the characters <a literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
Match a single character not present in the list below [^>]*
- * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- > matches the character > literally (case sensitive)
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
href=" matches the characters href=" literally (case sensitive)
1st Capturing Group ([^#"][^"]*)
- Match a single character not present in the list below [^#"]
  - #" matches a single character in the list #" (case sensitive)
  - Match a single character not present in the list below [^"]*
    - * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    - " matches the character " literally (case sensitive)
" matches the character " literally (case sensitive)

Try it @ regex101

This won't properly match all cases in HTML. As the OP stated, this is just an exercise in regular expressions.

score 0 · Answer 2 · answered Oct 19 '20 at 13:42

0

Try this : This instead matches any character that isn't the end of the tag or the beginning of a new tag.

<a [^\<\>]*href\=\"[^\#][^\"]*?\"

answered Oct 19 '20 at 13:42

Steve Tomlin

3,391
3
31
63

Finding links in HTML using regex

2 Answers2

Try it @ regex101