0

The text is:

<div class="left right">Lorem Ipsum is simply dummy text of the printing and</div> typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble'd it to make-shift type <a href="google.com">specimen book</a> and something [tag]else[/tag].

Essentially what I'm trying to do is extract all of the words above while abiding by these rules:

  1. word can contain dash and apostrophe (scramble'd and make-shift above)
  2. word cannot be within a link tag
  3. word cannot be within a block tag - [tag]
  4. word cannot be part of a tag name or html (class in class=", div, a, tag etc)

My initial thought is to remove tags and content of those tags I don't need like a and such. Even then, however, I am finding it hard to say match everything in between the div above, but not match the word 'div' or 'class' or 'left right'.

Appreciate any help. I currently have:

\s?[a-zA-Z0-9\'\-]+\s?

Which is shameful, I know.

Mat
  • 202,337
  • 40
  • 393
  • 406
Damien Roche
  • 13,189
  • 18
  • 68
  • 96
  • 1
    As always, this link for those that begin dabbling in Regex + html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – xanatos Oct 29 '11 at 09:41
  • Second @xanatos' comment. Convert the [block] tags to something XML-ish, then use XML tools to extract the text you want. Maybe then you could even apply your regex. (Lose the `\s` boundaries, though, or change them into assertions, if your regex engine supports that.) – tripleee Oct 29 '11 at 09:59
  • That first link is genius! I guess I'm fighting a losing battle here then. – Damien Roche Oct 29 '11 at 10:06
  • What regex engine are you using? (.NET, Java, PHP/PCRE, etc?) – Jeremy Stein Nov 10 '11 at 18:39

2 Answers2

0

This should work:

[^<>\[\]]+(?=[<[])

Carol Skelly
  • 351,302
  • 90
  • 710
  • 624
0

This will work with the .NET regex engine, but that's one of the few that support repetition in a negative look-behind.

(?<!<[^>]*)(?<!<a[^<]*)(?<!\[[^\]]*)(?<!\[tag[^[]*)\w[^\s<[]*
Jeremy Stein
  • 19,171
  • 16
  • 68
  • 83