1

I want to make a regex that will match links in HTML code. This is example that will explain it better. Something like this:

<a href="I NEED THIS1">  <img src="I NEED THIS2">  </a>  <a href="I DONT
NEED THIS" title="something">  </a>   <a href="I NEED THIS3" title="blah">
<figure> <img src="I NEED THIS4" alt="">   </figure>  </a>

I tried something like this, but it matches I DONT NEED THIS instead of I NEED THIS3.

<a href="([^"]*)"\s*.*?<img src="(.*?)".*?\s*<\/a>

I tried to add negative lookahead with , but no matter where I put it, it is like I didn't add it at all. I am not sure I understand negative lookahead correct, but I tried to add (?!</a>).

I used regex that finds words near each other, and it works, but it is really not very elegant solution :) It finds href and img src when distance between is 0 and 7 words:

<a href="([^"]*)"\W+(?:\w+\W+){0,7}?<img src="(.*?)".*?\s*<\/a>

It will be used in Excel VBA and I was testing it on online regex tester websites.
Any suggestion would be helpful.

Community
  • 1
  • 1
vlayausa
  • 31
  • 5

2 Answers2

1

Use the MSHTML parser:

Dim odoc As Object: Set odoc = CreateObject("htmlfile")
odoc.Open
odoc.Write htmlstr

For Each element In odoc.images
    MsgBox element.src
Next

For Each element In odoc.getElementsByTagName("a")
    MsgBox element.href
Next

You may need to remove a leading "about:" prefix.

Alex K.
  • 171,639
  • 30
  • 264
  • 288
  • This is correct according the famous advice at http://stackoverflow.com/a/1732454/122139 that has helped thousands. – Smandoli May 26 '16 at 13:00
0

Here's another solution.

(href="([^"]+).*(?=img src))|(img src="([^"]*))
  1. check for href="
  2. return everything before the next " -> first group you're interested in
  3. but only if there is img src following (positive lookahead)
  4. check for img src="
  5. return everything before the next " -> second group you're interested in

Demo: https://regex101.com/r/yS9bB4/1

tk78
  • 937
  • 7
  • 14