1

I am currently working on a project where I need to match specific html tags and replace them by some others.

I am using Javascript in order to do so, and the code looks like :

// html to update
html = '<div class="page-embed"><article><iframe src="https://alink" width="100%"></iframe></article></div>';

// regex that will match the specific "<div class="page-embed">...<iframe src="https://alink"></iframe>...</div>
 const regexIframeInsideDiv = /<\s*div\s*class="page-embed"[^>]*>.*?<\s*iframe[^>]*\s*src="(.*?)"\s*><\s*\/\s*iframe\s*>.*?<\s*\/\s*div\s*>/g;

html = html.replace(regexIframeInsideDiv, (_match, src) => {
      console.log(src);
      return `<oembed>${src}</oembed>`;
});

I use the () tool to get what is inside the source attribute as follow :

src="(.*?)"

Here is the problem :

If I run the code, the console will log :

https://alink" width="100%

where it should log :

https://alink

I might be missing something, like escape string or an error anything else.. but I don't know what.

Here is the expected behaviour :https://regexr.com/4tbj6

Thank you !

alevani
  • 13
  • 4
  • Parsing HTML with regex is notoriously difficult; there is [a famous humorous answer](https://stackoverflow.com/a/1732454/157957) advising to never attempt it. That doesn't mean there's no answer to your particular question, but it's a good idea to think about the future direction of your code, and whether a non-regex approach will be more suitable in the long run. – IMSoP Jan 31 '20 at 12:36

2 Answers2

1

In your regex, on the part you are matching src, it's not \s* but \s.*

src="(.*?)"\s.*>

// html to update
html = '<div class="page-embed"><article><iframe src="https://alink" width="100%"></iframe></article></div>';

// regex that will match the specific "<div class="page-embed">...<iframe src="https://alink"></iframe>...</div>
const regexIframeInsideDiv = /<\s*div\s*class="page-embed"[^>]*>.*?<\s*iframe[^>]*\s*src="(.*?)"\s.*><\s*\/\s*iframe\s*>.*?<\s*\/\s*div\s*>/g;

html = html.replace(regexIframeInsideDiv, (_match, src) => {
  console.log(src);
  return `<oembed>${src}</oembed>`;
});
Orelsanpls
  • 22,456
  • 6
  • 42
  • 69
  • Thank you, it works fine now. Could you maybe explain how this was working on regexr but not with JS ? – alevani Jan 31 '20 at 13:33
  • It didn't, try using the whole regex on the `Expression` input, then go to the `Details` menu at the bottom/right (in Tools). you will see that the match are the same as in javascript. The problem was not about `src="(.*?)"` but what was after. – Orelsanpls Jan 31 '20 at 13:38
0

Try this RegEx:

(?<=(<div class="page-embed".+iframe src="))(.*?)(?=")

Which searches for a String between src=" and the next " in a div with your class and an iframe.

PaulS
  • 850
  • 3
  • 17