Split a sentence with HTML into words (but leave inline HTML intact)

Question

I am looking for a way to use javascript for splitting a sentence with HTML into words, and leaving the inline HTML tags with the text content intact. Punctuation can be regarded as a part of the word it is closest to. I'd like to use regex, and probably preg_split() for splitting the sentences. Here follows an example:

A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
<b>even more</b> <u>words</u>

Preferably, I would like to end up with the following:

[0] => A
[1] => word,
[2] => <a href='#' title=''>words within tags should remain intact</a>,
[3] => so
[4] => here's
[5] => <b>even more</b>
[6] => <u>words</u>

I know about the discussion on parsing HTML with Regex (I enjoyed reading Bobince' answer :-P ), but I need to split the words of a sentence without harming html-tags with attributes. I don't see how I can do this with JS in a different way than Regex. Of course, if there are alternatives, I'd be more than happy to adapt them, to achieve a similar result. :-)

Edit: I searched for similar questions on Stackoverflow about this, but these don't tick the boxes for me. To put it a little into perspective:

splitting-up-html-code-tags-and-content: targets to split up the inline HTML, which is what I want to leave intact.
php-regex-to-match-outside-of-html-tags: targets all text nodes in a HTML snippet, even within HTML tags. But as a matter of fact, I want to target only the spaces outside of HTML elements (so even excluding the spaces within the text nodes being wrapped with HTML tags).

preg_split is not JavaScript. Preg_split() is PHP. In your first sentence you say you want Javascript. What have you tried yourself BTW? Please post your own trial and error and people will help. — vrijdenker, Nov 09 '14 at 17:50
Thanks for the comment and apologies. I probably confused languages here.. Using them both, but I am targeting JS here. I'm only getting started in understanding regex, that is why I wanted some help. I searched for some other questions around here, but most of them seem to split the words inside the HTML tags as well. I'll update my post with what I found, but does not tick the boxes. — Jeroen, Nov 09 '14 at 17:58
Small update: I'm currently looking into using `childNode`, `childValue`. This seems to be much easier and more logical to use. To be continued! — Jeroen, Nov 09 '14 at 19:40

score 3 · Answer 1 · answered Nov 09 '14 at 20:06

3

This is possible, but there will be some drawbacks to using a pure regex solution. The easiest to call out is nested HTML. The solution I'm about to show uses some back referencing to try get around this, but if you get some complicated nested HTML it will probably start failing in weird ways.

/(?:<(\w+)[^>]*>(?:[\w+]+(?:(?!<).*?)<\/\1>?)[^\s\w]?|[^\s]+)/g

Regex Demo

The regex uses back referencing and negative look behinds to get the work. You could potentially remove the back reference depending on your requirements. The back referencing helps with supporting nested tags.

JSFiddler Example - Check your console output for the example.

Here's the output from JS Fiddler (I formatted the output a bit)

[
  "A", 
  "word,", 
  "<a href='#' title=''>words within tags should remain intact</a>,", 
  "so", 
  "here's", 
  "<b>even more</b>", 
  "<u>words</u>"
]

Depending on you're use case you'll need to modify it to work for you. I considered a word anything that wasn't a space, but you may have different criteria.

One negative to this method is if the start HTML tag is at the end of a word, it won't be picked up properly. ie. testsomething else.

answered Nov 09 '14 at 20:06

Nathan

1,437
12
15

1

This answer is the answer for the question. I discovered that Regex is indeed rather difficult to use when it comes to splitting HTML sentences. In my case, there won't be nested HTML, but I do tend to look into `childNodes`, `childValue` right now. – Jeroen Nov 09 '14 at 20:49
This fails if there are comments in the HTML. Perhaps you should strip out comments entirely first? – soktinpk Nov 09 '14 at 21:08
This answer proves that it is possible with Regex to do the job. This answer also states that Regex might not be suitable for this purpose. This answer gives the reasons why to use the other answer. I think that when people find this question, the other answer might be more of an appropriate solution. +1 for the answer. I do want to approve the other one as the answer to go for. Sorry, but thank you very much for putting the use of Regex for this into perspective! :) – Jeroen Nov 09 '14 at 21:08
@Jeroen: Have to agree with you, the other answer from soktinpk would be the better solution. Regexes just aren't well suited for this type of problem. – Nathan Nov 09 '14 at 21:14
brillant answer – AlainIb Mar 22 '18 at 22:29

soktinpk · Accepted Answer · 2014-11-09T21:07:27.340

You can use the following snippet:

function splitIntoWords(div) {
  function removeEmptyStrings(k) {
    return k !== '';
  }
  var rWordBoundary = /[\s\n\t]+/; // Includes space, newline, tab
  var output = [];
  for (var i = 0; i < div.childNodes.length; ++i) { // Iterate through all nodes
    var node = div.childNodes[i];
    if (node.nodeType === Node.TEXT_NODE) { // The child is a text node
      var words = node.nodeValue.split(rWordBoundary).filter(removeEmptyStrings);
      if (words.length) {
        output.push.apply(output, words);
      }
    } else if (node.nodeType === Node.COMMENT_NODE) {
      // What to do here? You can do what you want
    } else {
      output.push(node.outerHTML);
    }
  }
  return output;
}

window.onload = function() {
  var div = document.querySelector("div");
  document.querySelector("pre").innerText = 'Output: ' + JSON.stringify(splitIntoWords(div));
}

<!-- Note you have to surround your html with a div element -->
<div>A word, <a href='#' title=''>words within tags should remain intact</a>, so here's
  <b>even more</b>  <u>words</u>
</div>
<pre></pre>

It iterates through all child nodes, takes the text nodes and splits them into words (you can do this safely since text nodes can't contain children).

This takes care of most issues. With this, HTML such as textTest will come out ["text", "Test"] unlike the answer above.

This may fail with There are: 4 words which results in ["There are", ":" /* Extra colon */, "4", "words"] (which it's supposed to do, but not sure if it is desirable).

I would think this is very safe with nested elements.

As opposed to what I wrote earlier, this is the answer to go for. However, the earlier submitted answer does put the use of this into perspective. That's why I changed the accepted answer to this one. To provide it as an alternative to using Regex (which can be very tricky, as Nathan suggested). However, I want to thank Nathan for the effor of explaining! — Jeroen, Nov 09 '14 at 21:11

Split a sentence with HTML into words (but leave inline HTML intact)

2 Answers2