Struggling with regex. Could anybody match using this text based on these rules?

Question

The text is:

<div class="left right">Lorem Ipsum is simply dummy text of the printing and</div> typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble'd it to make-shift type <a href="google.com">specimen book</a> and something [tag]else[/tag].

Essentially what I'm trying to do is extract all of the words above while abiding by these rules:

word can contain dash and apostrophe (scramble'd and make-shift above)
word cannot be within a link tag
word cannot be within a block tag - [tag]
word cannot be part of a tag name or html (class in class=", div, a, tag etc)

My initial thought is to remove tags and content of those tags I don't need like a and such. Even then, however, I am finding it hard to say match everything in between the div above, but not match the word 'div' or 'class' or 'left right'.

Appreciate any help. I currently have:

\s?[a-zA-Z0-9\'\-]+\s?

Which is shameful, I know.

As always, this link for those that begin dabbling in Regex + html: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — xanatos, Oct 29 '11 at 09:41
Second @xanatos' comment. Convert the [block] tags to something XML-ish, then use XML tools to extract the text you want. Maybe then you could even apply your regex. (Lose the `\s` boundaries, though, or change them into assertions, if your regex engine supports that.) — tripleee, Oct 29 '11 at 09:59
That first link is genius! I guess I'm fighting a losing battle here then. — Damien Roche, Oct 29 '11 at 10:06
What regex engine are you using? (.NET, Java, PHP/PCRE, etc?) — Jeremy Stein, Nov 10 '11 at 18:39

Carol Skelly · Answer 1 · 2011-11-02T19:07:39.093

0

This should work:

[^<>\[\]]+(?=[<[])

edited Nov 02 '11 at 19:07

answered Oct 29 '11 at 14:08

Carol Skelly

351,302
90
710
624

What's up with the down vote? It works fine as demonstrated here: http://gskinner.com/RegExr/ – Carol Skelly Nov 02 '11 at 19:08
+1 This works, except it doesn't break down matches into individual words. – Jeremy Stein Nov 10 '11 at 18:46

score 0 · Accepted Answer · answered Nov 10 '11 at 18:45

0

This will work with the .NET regex engine, but that's one of the few that support repetition in a negative look-behind.

(?<!<[^>]*)(?<!<a[^<]*)(?<!\[[^\]]*)(?<!\[tag[^[]*)\w[^\s<[]*

answered Nov 10 '11 at 18:45

Jeremy Stein

19,171
16
68
83

Struggling with regex. Could anybody match using this text based on these rules?

2 Answers2