I'm trying to find all links in a Wikipedia article while excluding fragments (links starting with #).
Initially I was using <a href=\"[^#]\S*?\"
which worked fine (although what it captures is a bit messy, I can clean this up later in python). But then I realized that "<a " isn't necessarily directly followed by "href", so I changed the expression to
<a .*?href=\"[^#]\S*?\"
My thought behind this was capture text starting with '<a ', followed by any characters zero to unlimited times until you reach 'href="', then a character that is not '#' followed by zero to unlimited characters that are not whitespace until a quote (") is reached.
Both of these are now captured, which is what I want
<a title="test" href="link"
<a href="link"
And this is not captured, which is also what I want
<a class="class1" href="#fragment">
But this is captured, which I do not want
<a href="#citewnotew1"></a></sup></div></td></tr><tr><th scope="row" style="line-height:1.2em; padding-right:0.65em;"><a href="/wiki/Filename_extension"
Why does this happen?