-1

I've been struggling with this due to my lacking REGEX experience. I need to extract the pattern of all img tags in html which occur inside p tags. i.e:

<p>Hello <img src="bbc.co.uk" /> World</p>
<img src="google.com" />
<p>Crazy <img src="google.com"> Town</p>

Should return:

<img src="bbc.co.uk" />
<img src="google.com">

I have this regex so far which captures the img pattern:

<img .+?(?=>)>

However it captures all imgs, where as I need only those that appear within a p tags, but do NOT want the p tag to be included in the result.

Many thanks

paddyc
  • 79
  • 1
  • 10

4 Answers4

1

If your programming language or tool supports capturing groups with regex, then you can use <p[^>]*>[^<]*(?:<[^>]*>[^<]*)*?(<img[^>]*>)[^<]*(?:<[^>]*>[^<]*)*?</p> to capture just the img tags within p tags.

Using Python as an example:

import re
html = '''<p>Hello <img src="bbc.co.uk" /> World</p>
<img src="stackoverflow.com" />
<p>Crazy <img src="google.com"> Town</p>'''
print(re.findall(r'<p[^>]*>[^<]*(?:<[^>]*>[^<]*)*?(<img[^>]*>)[^<]*(?:<[^>]*>[^<]*)*?</p>', html, re.IGNORECASE | re.DOTALL))

This outputs:

['<img src="bbc.co.uk" />', '<img src="google.com">']
blhsing
  • 91,368
  • 6
  • 71
  • 106
0

Try <p>.*(<img[^>]*>).*<\/p>

Use the inner group () for capture

mankowitz
  • 1,864
  • 1
  • 14
  • 32
0

You can't. HTML is a context-free language and regular expressions can only denote regular languages.

Edit: You can probably one image tag but if you're expecting a dynamic amount of tags, you can't do it with one regular expression.

CSharpFiasco
  • 204
  • 3
  • 8
0

Instead of parsing html with a regex which is not advisable you might use a DOMParser.

let parser = new DOMParser();
let html = `<p>Hello <img src="bbc.co.uk" /> World</p>
<img src="google.com" />
<p>Crazy <img src="google.com"> Town</p>`;
let doc = parser.parseFromString(html, "text/html");
let imgs = doc.querySelectorAll("p img");
imgs.forEach((img) => {
  console.log(img.outerHTML)
});
The fourth bird
  • 154,723
  • 16
  • 55
  • 70