I am looking for a way to use javascript for splitting a sentence with HTML into words, and leaving the inline HTML tags with the text content intact. Punctuation can be regarded as a part of the word it is closest to. I'd like to use regex, and probably preg_split()
for splitting the sentences. Here follows an example:
A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
<b>even more</b> <u>words</u>
Preferably, I would like to end up with the following:
[0] => A
[1] => word,
[2] => <a href='#' title=''>words within tags should remain intact</a>,
[3] => so
[4] => here's
[5] => <b>even more</b>
[6] => <u>words</u>
I know about the discussion on parsing HTML with Regex (I enjoyed reading Bobince' answer :-P ), but I need to split the words of a sentence without harming html-tags with attributes. I don't see how I can do this with JS in a different way than Regex. Of course, if there are alternatives, I'd be more than happy to adapt them, to achieve a similar result. :-)
Edit: I searched for similar questions on Stackoverflow about this, but these don't tick the boxes for me. To put it a little into perspective:
- splitting-up-html-code-tags-and-content: targets to split up the inline HTML, which is what I want to leave intact.
- php-regex-to-match-outside-of-html-tags: targets all text nodes in a HTML snippet, even within HTML tags. But as a matter of fact, I want to target only the spaces outside of HTML elements (so even excluding the spaces within the text nodes being wrapped with HTML tags).