Split string into associative array

Question

Example string (html content):

some content
<h2>title 1</h2>
<p>more content</p>
<h2>title 2</h2>
rest of the content

I need to split this into associative array by the <h2></h2>, yet keep all contents of the string.

Desired outputs:

array(){
  'text1' => 'some content',
  'title1' => 'title 1',
  'text2' => '<p>more content</p>',
  'title2' => 'title 2',
  'text3' => 'rest of the content'
}

or

array(){
  [0] => {
    'text' => 'some content',
    'title' => 'title 1'
  },
  [1] => {
    'text' => '<p>more content</p>',
    'title' => 'title 2'
  },
  [2] => {
    'text' => 'rest of the content'
  }
}

What I tried

preg_split() with PREG_SPLIT_DELIM_CAPTURE almost does the job, but it outputs indexed array.
I tried using regex, but it fails capturing text3:
(.*?)(<h2.*?<\/h2>)

Any help or idea is very appreciated.

Use this `(?s)(?:
(.*?)
|\s*(.+?)\s*(?=
.*?
|$))` forget that duplicate junk. Parsing html with a DOM will fail if the html is junked up. Use something that works. Or, you could try to find a DOM parser that can go past malformed html (and you can't). — , Feb 22 '16 at 15:56
@sln Thank you very much! This helped a lot. If you want to post your comment as an answer, I'll be glad to accept it. On the other note, I don't get all the fuss about DOM parsers and stuff, all I need is to split the bloody string, it doesn't matter if it's html or not. — Matija Mrkaic, Feb 22 '16 at 16:12
@MatijaMrkaic - Questions marked as duplicates cannot be answered. And in fact, your question is going into the boneyard never to be seen again. You can petition the admin if you'd like. — , Feb 22 '16 at 16:14
@sln Usually a DOM parser in HTML mode *can* parse incomplete and even malformed HTML. That's the reason why not every second webpage fails to display with "Malformed markup". PHP's DOM parser which is based on libxml2 makes a good job there. A regex can't be used for this kind of tasks, at least it will not work reliably and is hard to maintain. The level of unreliability and un-maintenancy will increase with the complexity of the section to be parsed. — hek2mgl, Feb 22 '16 at 16:36
@hek2mgl - If a DOM parser can go past malformed sgml then what makes it different than a Regular Expression Parser. Seriously, consider this simple looking thing `(?s)<[\w:]+(?:".*?"|'.*?'|[^>]*?)+>`. It is probably the most complex regex ever made that parses an html tag malformed or not. This is engine power. — , Feb 22 '16 at 16:45
@sln Sure, a parser internally (likely) also uses regexes to identify tokens. However, a parser is more than a single regex. I really suggest to read the `Flex/Bison` O'Reilly. I'm pretty sure you'll have fun. (I don't say that an SGML parser is built using Bison, however the book is a nice read and explains the concepts very well) — hek2mgl, Feb 22 '16 at 16:50

score 0 · Answer 1 · answered Feb 22 '16 at 15:49

you should be able to do a regex split:

preg_split ("/<\/?h2>/", sampletext)

where sampletext here looks just like your input example. we can assume that every 2 splits is equivalent to one <h2></h2> pair, so you can label them according to their array index.

score 0 · Answer 2 · answered Feb 22 '16 at 15:52

I made you a function real quick, it has only been tested on your content, but maybe it will be helpful for you.

<?php
function splitTitlesAndContent($needle1,$needle2,$content){
    $spli = explode($needle1,$content);
    $arr = array();
    $titlenum = 1;
    $contentnum = 1;

    foreach($spli as $spl){
        $expl = explode($needle2,$spl);

        if(isset($expl[1])){
            $arr['title' . $titlenum] = trim($expl[0]);
            $titlenum++;

            $arr['content' . $contentnum] = trim($expl[1]);
            $contentnum++;
        }
        else{
            $arr['content' . $contentnum] = trim($expl[0]);
            $contentnum++;
        }
    }
    return $arr;
}

$content = 'some content
<h2>title 1</h2>
more content
<h2>title 2</h2>
rest of the content';

$splitted = splitTitlesAndContent('<h2>','</h2>',$content);
print_r($splitted);
?>

You can try it out here: http://sandbox.onlinephpfunctions.com/code/e80b68d919c0292e7b52d2069128e21ba1614f4c

Split string into associative array

Example string (html content):

Desired outputs:

What I tried

(.*?)

.*?

2 Answers2