1

how to match all contents outside a HTML tag?

My pseudo-HTML is:

<h1>aaa</h1>
bbb <img src="bla" /> ccc
<div>ddd</div>

I used the regular expression,

(?<=^|>)[^><]+?(?=<|$)

which would give me: "aaa bbb ccc ddd"

All I need is a way to ignore HTML tags with return: "bbb ccc"

  • possible duplicate of [Can you provide some examples of why it is hard to parse XML and HTML with a regex?](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege) – Brad Mace Jul 09 '11 at 20:53
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Paŭlo Ebermann Sep 15 '11 at 14:16

3 Answers3

6

Regexes are a clunky and unreliable way to work on markup. I would suggest using a DOM parser such as SimpleHtmlDom:

//get the textual content of all hyperlinks on specified page.
//you can use selectors, e.g. 'a.pretty' - see the docs
echo file_get_html('http://www.example.org')->find('a')->plaintext; 

If you want to do that on the client, you can use a library such as jQuery like so:

$('a').each(function() {
    alert($(this).text());
});
karim79
  • 339,989
  • 67
  • 413
  • 406
0

Look for an approriate regex to match complete tags (e.g in a library like http://regexlib.com/) and remove them with using the substitute operator s///. Then use the rest.

Fritz G. Mehner
  • 16,550
  • 2
  • 34
  • 41
  • I used this one, but this expression is typed incorrectly: s/(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)//|(?<=^|>)[^><]+?(?=<|$) –  Jun 09 '09 at 21:10
0

Thanks everybody,

the expressions of both together would be dirty work, but I would like the opposite output.

(\<(.*?)\>)(.*?)(\<\/(.*?)\>)|(<[a-zA-Z\/][^>]*>)

As pseudo string:

<h1>aaa</h1>

bbb <img src="bla" /> ccc

<div>ddd</div> jhgvjhgjh zhg zt <div>ddd</div>

<div>dsada</div> hbhgjh

For simplification, I use this tool.