RegEx: Matching text that is not inside and part of a HTML tag

Question

how to match all contents outside a HTML tag?

My pseudo-HTML is:

<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>

I used the regular expression,

(?<=^|>)[^><]+?(?=<|$)

which would give me: "aaa bbb ccc ddd"

All I need is a way to ignore HTML tags with return: "bbb ccc"

possible duplicate of [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege) — Brad Mace, Jul 09 '11 at 20:53
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Paŭlo Ebermann, Sep 15 '11 at 14:16

score 6 · Answer 1 · answered Jun 09 '09 at 15:29

Regexes are a clunky and unreliable way to work on markup. I would suggest using a DOM parser such as SimpleHtmlDom:

//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext;

If you want to do that on the client, you can use a library such as jQuery like so:

$('a').each(function() {
    alert($(this).text());
});

score 0 · Answer 2 · answered Jun 09 '09 at 15:32

0

Look for an approriate regex to match complete tags (e.g in a library like http://regexlib.com/) and remove them with using the substitute operator s///. Then use the rest.

answered Jun 09 '09 at 15:32

Fritz G. Mehner

16,550
2
34
41

I used this one, but this expression is typed incorrectly: s/(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)//|(?<=^|>)[^><]+?(?=<|$) – Jun 09 '09 at 21:10

score 0 · Answer 3 · 2011-09-27T21:59:09.453

0

Thanks everybody,

the expressions of both together would be dirty work, but I would like the opposite output.

(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)

As pseudo string:

<h1>aaa</h1>

bbb <img src="bla" /> ccc

<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>

<div>dsada</div> hbhgjh

For simplification, I use this tool.

edited Sep 27 '11 at 21:59

answered Jun 09 '09 at 21:07

RegEx: Matching text that is not inside and part of a HTML tag

3 Answers3

Linked