The text is:
<div class="left right">Lorem Ipsum is simply dummy text of the printing and</div> typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scramble'd it to make-shift type <a href="google.com">specimen book</a> and something [tag]else[/tag].
Essentially what I'm trying to do is extract all of the words above while abiding by these rules:
- word can contain dash and apostrophe (scramble'd and make-shift above)
- word cannot be within a link tag
- word cannot be within a block tag - [tag]
- word cannot be part of a tag name or html (class in class=", div, a, tag etc)
My initial thought is to remove tags and content of those tags I don't need like a and such. Even then, however, I am finding it hard to say match everything in between the div above, but not match the word 'div' or 'class' or 'left right'.
Appreciate any help. I currently have:
\s?[a-zA-Z0-9\'\-]+\s?
Which is shameful, I know.