-4

I would like to split a string that looks like:

This is <strong>a</strong> test <a href="#test">link</a> and <br /> line. break

into the following with JavaScript:

[
    'This',
    'is',
    '<strong>a</strong>',
    'test',
    '<a href="#test">link</a>',
    '<br />',
    'line.',
]

I tried splitting on spaces, and < >, but that obviously doesn't work for tags like strong and a. I'm not sure how to write a regex that doesn't split within HTML tags. I also tried to use jQuery children(), but it doesn't extract plain text, just the html tags. Any help would be great.

Chris
  • 1,068
  • 2
  • 14
  • 30

3 Answers3

1

If the code is executing in a browser, using the browser's parser to separate the string into text and tag components may provide an alternative workaround:

var text = 'This is <strong>a</strong> <a href="#test">link</a> and <br /> line. break'

function splitHTML( text) {
    var parts = [];
    var div = document.createElement('DIV');
    div.innerHTML = text;
    div.normalize();
    for( var node = div.firstChild; node; node=node.nextSibling) {
         if( node.nodeType == Node.TEXT_NODE) {
             parts.push.apply( parts, node.textContent.split(" "));
         }
         else if( node.nodeType == Node.ELEMENT_NODE) {
             parts.push( node.outerHTML);
         }
    }
    return parts;
}
console.log( splitHTML( text));

Note the line that adds text nodes split by spaces to the result

 parts.push.apply( parts, node.textContent.split(" "));

is for demonstration and needs further work to prevent zero length strings in the ouput for spaces between text and html tagged elements. Also the html tags are reconstructed from the DOM element and may not exactly match the input: in this case the XHTML tags <br \> are returned as <br> HTML tags (which don't take a closing tag).

The general idea is to side step parsing html using a regex by parsing it with the browser. Understandably this may or may not fit with the target environment and a full set of requirements.

traktor
  • 17,588
  • 4
  • 32
  • 53
0

To achieve what you want, you need to consider this:

Rule 1) if no "<" occurred yet, simply split at " ".

Rule 2) if "<" occurred, look for "/>" or "/"..">" and split after it, then start at rule 1 again.

Apply those rules while looping through a string and you are golden.

Making this recursive, i.E. nested tags like

<div>
    <p>Hi</p>
    <p>Bye</p>
</div>

is harder. As mentioned above, actually parsing a html tree is very complex.

Jan Mund
  • 151
  • 6
0

Try this:

#(?:(?!<)[^<>]+(?!>))|(?:<(?=[^/>]+\/>).*\/>)|(?:<([^\s]+).*>.*(?=<\/\1>)<\/\1>)#g

It should work in simple cases, All that I can thik of right now. Use captured group to find out TAG name, then execute it recursivly for block elements as div.

Aboalnaga
  • 602
  • 6
  • 16