Regex to extract HTML, leave text

Question

I have this piece of HTML:

<div class="embed">
<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456"></iframe>
Some text I don't want
</div>

This is how it is being inserted into the HTML:

<div class="embed"><?php echo $item['embed_html']; ?></div>

This is what

 $item['embed_html']

is echoing out:

<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456"></iframe>Some text I don't want

So I don't want to parse the whole document, just this specific string.

Don't worry, this isn't "outside user" inputted HTML, before anyone points out the security issues with allowing raw code on to a page...

I need to extract the HTML but leave the text (so it would look like this):

<div class="embed">
<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456"></iframe>
</div>

There are multiple different embed codes, so I guess what I'm asking is what is the best way to remove text that is not wrapped in an HTML element (between < and >) (<img, <p, <div, <iframe, <object, <embed, <video etc may all be used in this section). Just that if there is any text added to it that is not wrapped in a tag it should remove it from the string.

I don't want to wrap the offending text in a tag, I want to completely remove it. In a way, the reverse of strip_tags()

possible duplicate of [Extract parts of html using regex](http://stackoverflow.com/questions/2693678/extract-parts-of-html-using-regex) and dozens of others. Also, using a regex to parse HTML/XML or anything else that has it's own parser is almost always a bad idea, and typically causes more problems than it solves. — Ken White, Nov 16 '11 at 12:26

Regexident · Accepted Answer · 2011-11-16T12:27:30.617

3

This is a simple regex that would do what you want in 99% of cases:

<[^>]+>

All it does though is match XML/HTML tags. That's it. There's no clean way of telling it to only match text inside the DOM-subtree of a certain node (such as <div class="embed">). For this you would to use a context free parser, such as a DOM-parser.

Your sample input would be matched into:

{
    "<div class="embed">",
    "<iframe width="300" height="200" frameborder="0" allowfullscreen="" src="http://www.youtube.com/embed/123456">",
    "</iframe>",
    "</div>"
}

Given this:  input text however you would end up with <foo> being extracted despite being technically commented out. Removing all occurences of regex  beforehand should solve that though.

Anyway, in general you're best off using a DOM parser for anything XML/HTML.

edited Nov 16 '11 at 12:27

answered Nov 16 '11 at 12:14

Regexident

29,441
10
93
100

@Gordon: indeed it does, as I mentioned in the answer (ninja update, a minute ago). In it I also gave a recommendation to use a DOM parser instead, if a search scope inside the DOM Tree is required. – Regexident Nov 16 '11 at 12:30
2

@Gordon, @kieran: No need to have a fight about this, guys. From how I see it it was a simple case of miswording. What @kieran apparently meant with "text that is not wrapped in an HTML element" was "not between any pair of `<` and `>`". If I'm right with this assumption then a simply replacing "HTML element" with "HTML tag bracket pair" (or the like) should sufficiently fix the miswording/confusion. – Regexident Nov 16 '11 at 15:25

Regex to extract HTML, leave text

1 Answers1