Tokenize HTML string in JavaScript

Question

I would like to split a string that looks like:

This is <strong>a</strong> test <a href="#test">link</a> and <br /> line. break

into the following with JavaScript:

[
    'This',
    'is',
    '<strong>a</strong>',
    'test',
    '<a href="#test">link</a>',
    '<br />',
    'line.',
]

I tried splitting on spaces, and < >, but that obviously doesn't work for tags like strong and a. I'm not sure how to write a regex that doesn't split within HTML tags. I also tried to use jQuery children(), but it doesn't extract plain text, just the html tags. Any help would be great.

This is more complicated than it seems. Parsing HTML is hard. — elclanrs, Dec 27 '17 at 22:22
I can't imagine a reason why you would be doing this but if it's for some sort of user-input (comments, forum posts, etc) it would be easier(and safer) to create your own flavor of markdown than delve into the realm of tokenizing HTML. — zfrisch, Dec 27 '17 at 23:04

score 1 · Answer 1 · answered Dec 27 '17 at 23:46

If the code is executing in a browser, using the browser's parser to separate the string into text and tag components may provide an alternative workaround:

var text = 'This is <strong>a</strong> <a href="#test">link</a> and <br /> line. break'

function splitHTML( text) {
    var parts = [];
    var div = document.createElement('DIV');
    div.innerHTML = text;
    div.normalize();
    for( var node = div.firstChild; node; node=node.nextSibling) {
         if( node.nodeType == Node.TEXT_NODE) {
             parts.push.apply( parts, node.textContent.split(" "));
         }
         else if( node.nodeType == Node.ELEMENT_NODE) {
             parts.push( node.outerHTML);
         }
    }
    return parts;
}
console.log( splitHTML( text));

Note the line that adds text nodes split by spaces to the result

 parts.push.apply( parts, node.textContent.split(" "));

is for demonstration and needs further work to prevent zero length strings in the ouput for spaces between text and html tagged elements. Also the html tags are reconstructed from the DOM element and may not exactly match the input: in this case the XHTML tags <br \> are returned as <br> HTML tags (which don't take a closing tag).

The general idea is to side step parsing html using a regex by parsing it with the browser. Understandably this may or may not fit with the target environment and a full set of requirements.

score 0 · Answer 2 · answered Dec 27 '17 at 22:55

To achieve what you want, you need to consider this:

Rule 1) if no "<" occurred yet, simply split at " ".

Rule 2) if "<" occurred, look for "/>" or "/"..">" and split after it, then start at rule 1 again.

Apply those rules while looping through a string and you are golden.

Making this recursive, i.E. nested tags like

<div>
    <p>Hi</p>
    <p>Bye</p>
</div>

is harder. As mentioned above, actually parsing a html tree is very complex.

Aboalnaga · Answer 3 · 2017-12-27T23:24:42.133

0

Try this:

#(?:(?!<)[^<>]+(?!>))|(?:<(?=[^/>]+\/>).*\/>)|(?:<([^\s]+).*>.*(?=<\/\1>)<\/\1>)#g

It should work in simple cases, All that I can thik of right now. Use captured group to find out TAG name, then execute it recursivly for block elements as div.

edited Dec 27 '17 at 23:24

answered Dec 27 '17 at 23:19

Aboalnaga

602
6
16

Tokenize HTML string in JavaScript

3 Answers3

Try this: