I'm brand new to regular expressions, and I am trying to solve the two following problems:
Write a regular expression that extracts all the links and the corresponding link text from an HTML page. For example, if you wanted to parse:
text1 <a href="http://example.com">hello, world</a> text2
and get the result
http://example.com <tab> hello, world
Do the same thing, but also handle cases where <...> are nested:
text1 <a href="http://example.com" onclick="javascript:alert('<b>text2</b>')">hello, world</a> text3
So far I am still on the first question, and I've tried going about this several ways. I think my best answer to the first has been the regex (?<=a href=\")(.*)(?=</a>)
which gives me: http://example.com">hello, world
This seems good enough to me, but I don't know how I'm supposed to approach the second part. Any help or insight would be greatly appreciated.