Javascript split body of html with REGEX

Question

Im looking to split a body of html into an array.

Here is an example of what the code looks like:

<p><h2 class="title">Title 1</h2></p>
<p>Section 1: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>
<p>velit saepe ducimus aspernatur, quam quaerat autem. Consectetur, vitae.</p>
<p><h2 class="title">Title 2</h2></p>
<p>Section 2: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>
<p><h2 class="title">Title 3</h2></p>
<p>Section 3: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>

Basically I'd like to split the sections up using a positive lookbehind with the following pattern <p><h2 class="title">*</h2></p> or any other type of regex pattern.

Essentionally I'm looking to have an array that contains something like so...

<p><h2 class="title">Title 1</h2></p>
<p>Section 1: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>
<p>velit saepe ducimus aspernatur, quam quaerat autem. Consectetur, vitae.</p>

<p><h2 class="title">Title 2</h2></p>
<p>Section 2: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>

<p><h2 class="title">Title 3</h2></p>
<p>Section 3: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>

This is the code that will alway be constant <p><h2 class="title">*</h2></p>. The content will always be encapsulate within <p> tags.

Here is an example of the script Im parsing the data through...

$(contentArr).each(function(ele, idx){

        var content      = ele, contentTrun;
        var contentRegex = /(<p>.*<\/p>)/im;
        var matchContent = contentRegex.exec(content);

        //parse block to get it ready for styling and effect
        var contentRegex    = /((?!<p><h2 class="title".*?\n)<p>.*<\/p>)/igm;
        var parsedContent   = content.replace(contentRegex, "$1");

        //insert parsed content into html block
        $("pressBlocks").insert("<div class=\"blockContentOutter\">\
                                    <span class=\"accordionText\">... <a class=\"readMore\">Read More</a></span>\
                                        <div class=\"blockContent\">"+parsedContent+"</div>\
                                </div>");
    });

you don't need any look arounds, just parenthesize the pattern you have and collect elements by twos. — dandavis, Jun 25 '15 at 21:23
@Sirko: sometimes the DOM produces different output than input, especially with XHTML ( ex: `` vs ``, quoted attribs, etc), so regexp is the only cross-platform way of getting exact output. i know the dom is easier for most uses, but if you need to save the output (not just display it), client-side DOM serialization is not a reliable method. — dandavis, Jun 25 '15 at 21:32
I have to agree with @Sirko--what makes you think regex is the proper tool for this? DOM traversal is likely going to be more simple and more accurate. Unless you have some strange use case that is not stated in the question, I think you have created an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). — Dave, Jun 25 '15 at 21:48
I am in doubt whether to post [this](http://jsfiddle.net/7k27ro63/). I think XPath or DOM would be more correct here. — Wiktor Stribiżew, Jun 25 '15 at 22:38
Obligatory link to [bobince's](http://stackoverflow.com/users/18936) post: [You can't parse XHTML with regex.](http://stackoverflow.com/a/1732454/1509264) — MT0, Jun 25 '15 at 23:03
The code I provided is not the best but its an example of what gets returned from a JSON - there is way more data but that is just a snippet. The end goal is to parse it into div's. — D.Rivera, Jun 25 '15 at 23:12
Thanks for the responses... I've added additional code up top to give you a better idea of what I'm trying to do. — D.Rivera, Jun 25 '15 at 23:19
Using a `h2` element inside a `p` element is syntactically incorrect HTML. If you try doing it you will find that the `h2` element implicitly closes the `p` element and they are not nested. http://stackoverflow.com/a/4676018/1509264 — MT0, Jun 25 '15 at 23:40

webdev-dan · Answer 1 · 2015-06-26T00:58:46.537

Well, ..if you really need a splitter and you know that the input format remains unchanged - just split it with something like this:

var splitter = "<p><h2 class=\"title\">";
output = inputHTML.split(splitter);
for(var i=1; i<output.length){
    output[i] = splitter + output[i];
}

but really - there're better ways to do it nice :)

eg. with jQuery:

var output = [];
var $input = $('<div/>').append(inputHTML);
$input.children().each( function(){
    var $this = $(this);
    if($this.find('h2.title').length || output.length==0){
        output.push( $('<div/>').append($this) );
    } else {
        output[output.length - 1].append($this);
    }
});

this will give you your paragraphs splitted in divs ready in 'output' array - to do whatever you need with them.

I've just noticed - that @MT0 is absolutely right - it is not correct to wrap h2 element inside of paragraphs - so my code will work - but only if you nest your inputHTML correctly - with div's or sections or other block elements instead of paragraphs:

<div><h2 class="title">Title 1</h2></div>
<div>Section 1: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</div>
<div>velit saepe ducimus aspernatur, quam quaerat autem. Consectetur, vitae.</div>
<div><h2 class="title">Title 2</h2></div>
<div>Section 2: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</div>
<div><h2 class="title">Title 3</h2></div>
<div>Section 3: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</div>

score 0 · Answer 2 · answered Jun 26 '15 at 00:23

As I noted in the comments, this is invalid syntax for HTML:

<p><h2>...</h2></p>

The h2 tag will implicitly close the p tag and they will not be nested (and you will have an empty paragraph before the first heading).

You can solve your problem without regular expressions (although you will need to fix the HTML you are inputting):

contentArr = [
    "<h2 class=\"title\">Title 1</h2>\
<p>Section 1: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>\
<p>velit saepe ducimus aspernatur, quam quaerat autem. Consectetur, vitae.</p>\
<h2 class=\"title\">Title 2</h2>\
<p>Section 2: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>\
<h2 class=\"title\">Title 3</h2>\
<p>Section 3: Lorem ipsum dolor sit amet, consectetur adipisicing elit.</p>"
];

$(contentArr).each( function( index, element ){
    $( element ).each( function( i, e ){
        if ( !$( e ).is( "h2" ) )
            return;
        $( '<div class="blockContentOuter" />' )
            .append( '<span class="accordionText">... <a class="readMore">Read More</a></span>' )
            .append( $( '<div class="blockContent" />')
                .append( $(e).nextUntil( "h2" ) ) )
            .appendTo( '#pressBlocks' );
    });
});

.blockContentOuter {
    background-color: lightgrey;
    border: 1px solid darkgrey;
    margin-top: 0.5em;
}

.blockContent {
    background-color: white;
    border: 1px solid darkgrey;
}

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="pressBlocks" />

Javascript split body of html with REGEX

2 Answers2