-1

Example string (html content):

some content
<h2>title 1</h2>
<p>more content</p>
<h2>title 2</h2>
rest of the content

I need to split this into associative array by the <h2></h2>, yet keep all contents of the string.

Desired outputs:

array(){
  'text1' => 'some content',
  'title1' => 'title 1',
  'text2' => '<p>more content</p>',
  'title2' => 'title 2',
  'text3' => 'rest of the content'
}

or

array(){
  [0] => {
    'text' => 'some content',
    'title' => 'title 1'
  },
  [1] => {
    'text' => '<p>more content</p>',
    'title' => 'title 2'
  },
  [2] => {
    'text' => 'rest of the content'
  }
}

What I tried

preg_split() with PREG_SPLIT_DELIM_CAPTURE almost does the job, but it outputs indexed array.
I tried using regex, but it fails capturing text3:
(.*?)(<h2.*?<\/h2>)

Any help or idea is very appreciated.

Matija Mrkaic
  • 1,817
  • 21
  • 29
  • Are those linebreaks actual new lines? – Glubus Feb 22 '16 at 15:26
  • are "some content", "more content" ... only text or html? – Casimir et Hippolyte Feb 22 '16 at 15:30
  • Yes, content is HTML. – Matija Mrkaic Feb 22 '16 at 15:43
  • if you are not preferring regex, try with `str_word_count` – Renjith V R Feb 22 '16 at 15:49
  • Use a DOM Parser to parse HTML. – hek2mgl Feb 22 '16 at 15:52
  • 1
    Use this `(?s)(?:

    (.*?)

    |\s*(.+?)\s*(?=

    .*?

    |$))` forget that duplicate junk. Parsing html with a DOM will fail if the html is junked up. Use something that works. Or, you could try to find a DOM parser that can go past malformed html (and you can't).
    –  Feb 22 '16 at 15:56
  • @sln Thank you very much! This helped a lot. If you want to post your comment as an answer, I'll be glad to accept it. On the other note, I don't get all the fuss about DOM parsers and stuff, all I need is to split the bloody string, it doesn't matter if it's html or not. – Matija Mrkaic Feb 22 '16 at 16:12
  • @MatijaMrkaic - Questions marked as duplicates cannot be answered. And in fact, your question is going into the boneyard never to be seen again. You can petition the admin if you'd like. –  Feb 22 '16 at 16:14
  • @sln Usually a DOM parser in HTML mode *can* parse incomplete and even malformed HTML. That's the reason why not every second webpage fails to display with "Malformed markup". PHP's DOM parser which is based on libxml2 makes a good job there. A regex can't be used for this kind of tasks, at least it will not work reliably and is hard to maintain. The level of unreliability and un-maintenancy will increase with the complexity of the section to be parsed. – hek2mgl Feb 22 '16 at 16:36
  • @hek2mgl - If a DOM parser can go past malformed sgml then what makes it different than a Regular Expression Parser. Seriously, consider this simple looking thing `(?s)<[\w:]+(?:".*?"|'.*?'|[^>]*?)+>`. It is probably the most complex regex ever made that parses an html tag malformed or not. This is engine power. –  Feb 22 '16 at 16:45
  • @sln Sure, a parser internally (likely) also uses regexes to identify tokens. However, a parser is more than a single regex. I really suggest to read the `Flex/Bison` O'Reilly. I'm pretty sure you'll have fun. (I don't say that an SGML parser is built using Bison, however the book is a nice read and explains the concepts very well) – hek2mgl Feb 22 '16 at 16:50
  • 1
    I'll check it out, thanks. –  Feb 22 '16 at 16:52

2 Answers2

0

you should be able to do a regex split:

preg_split ("/<\/?h2>/", sampletext)

where sampletext here looks just like your input example. we can assume that every 2 splits is equivalent to one <h2></h2> pair, so you can label them according to their array index.

bmbigbang
  • 1,318
  • 1
  • 10
  • 15
0

I made you a function real quick, it has only been tested on your content, but maybe it will be helpful for you.

<?php
function splitTitlesAndContent($needle1,$needle2,$content){
    $spli = explode($needle1,$content);
    $arr = array();
    $titlenum = 1;
    $contentnum = 1;

    foreach($spli as $spl){
        $expl = explode($needle2,$spl);

        if(isset($expl[1])){
            $arr['title' . $titlenum] = trim($expl[0]);
            $titlenum++;

            $arr['content' . $contentnum] = trim($expl[1]);
            $contentnum++;
        }
        else{
            $arr['content' . $contentnum] = trim($expl[0]);
            $contentnum++;
        }
    }
    return $arr;
}

$content = 'some content
<h2>title 1</h2>
more content
<h2>title 2</h2>
rest of the content';

$splitted = splitTitlesAndContent('<h2>','</h2>',$content);
print_r($splitted);
?>

You can try it out here: http://sandbox.onlinephpfunctions.com/code/e80b68d919c0292e7b52d2069128e21ba1614f4c