3

This is a follow-up to my question yesterday - Recursive UL LI to PHP multi-dimensional array - I've almost managed to convert the HTML block to an array, though there is a slight problem that I cannot fix. When processing the HTML block below, the output array does not quite follow what has been inputted (and I cannot see where I'm going wrong and need a fresh pair of eyes!!).

I've included the following items:

  • HTML Block
  • PHP Function and Processing
  • Output

HTML Block

Basically takes the form of:

-A
  -B
    -C
----
-D
  -E
    -F
----
-G
  -H
    -I

As follows:

<li>
    <ul>
        <li>A</li>
        <li>
            <ul>
                <li>B</li>
                <li>
                    <ul>
                        <li>C</li>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
</li>
<li>
    <ul>
        <li>D</li>
        <li>
            <ul>
                <li>E</li>
                <li>
                    <ul>
                        <li>F</li>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
</li>
<li>
    <ul>
        <li>G</li>
        <li>
            <ul>
                <li>H</li>
                <li>
                    <ul>
                        <li>I</li>
                    </ul>
                </li>
            </ul>
        </li>
    </ul>
</li>

PHP Function and Processing

function process_ul($output_data, $data, $key, $level_data, $level_key){

    if(substr($data[$key], 0, 3) == '<ul'){
        // going down a level in the tree
        $level_key++;

        // check to see if the level key exists within the level data, else create it and set to zero
        if(!is_numeric($level_data[$level_key])){
            $level_data[$level_key] = 0;
        }

        // increment the key to look at the next line
        $key++;

        if(substr($data[$key], 0, 4) !== '</ul'){
            while(substr($data[$key], 0, 4) !== '</ul'){
                // whilst we don't have an end of list, do some recursion and keep processing the array

                $returnables = process_ul($output_data, $data, $key, $level_data, $level_key);
                $output_data = $returnables['output'];
                $data = $returnables['data'];
                $key = $returnables['key'];
                $level_data = $returnables['level_data'];
                $level_key = $returnables['level_key'];
            }
        }
    }

    if(substr($data[$key], 0, 4) !== '</ul' && $data[$key] !== "<li>" && $data[$key] !== "</li>"){
        // we don't want to be saving lines with no data or the ends of a list

        // get the array key value so we know where to save it in our array (basically so we can't overwrite anything that may already exist
        $this_key = &$output_data;
        for($build_key=0;$build_key<($level_key+1); $build_key++){
            $this_key =& $this_key[$level_data[$build_key]];
        }

        if(is_array($this_key)){
            // look at the next key, find the next open one
            $this_key[(array_pop(array_keys($this_key))+1)] = $data[$key];
        } else {
            // a new entry, so nothing to worry about
            $this_key = $data[$key];
        }
        $level_data[$level_key]++;
    } else if(substr($data[$key], 0, 4) == '</ul'){
        // going up a level in the tree
        $level_key--;
    }

    // increment the key to look at the next line when we loop in a moment
    $key++;

    // prepare the data to be returned
    $return_me = array();
    $return_me['output'] = $output_data;
    $return_me['data'] = $data;
    $return_me['key'] = $key;
    $return_me['level_data'] = $level_data;
    $return_me['level_key'] = $level_key;

    // return the data
    return $return_me;
}


// explode the data coming in by looking at the new lines
$input_array = explode("\n", $html_ul_tree_in); 

// get rid of any empty lines - we don't like those
foreach($input_array as $key => $value){
    if(trim($value) !== ""){
        $input_data[] = trim($value);
    }
}

// set the array and the starting level
$levels = array();
$levels[0] = 0;
$this_level = 0;

// loop around the data and process it
for($i=0; $i<count($input_data); $i){
    $returnables = process_ul($output_data, $input_data, $i, $levels, $this_level);
    $output_data = $returnables['output'];
    $input_data = $returnables['data'];
    $i = $returnables['key'];
    $levels = $returnables['level_data'];
    $this_level = $returnables['level_key'];
}

// let's see how we did
print_r($output_data);

Output

Note that D is in the wrong position, should be in position [0][2] - not [0][1][2], and every other position after D is out by 1 place (I'm sure you can tell by looking).

Basically takes the form of:

-A
  -B
    -C
  -D
----
  -E
    -F
  -G
----
  -H
    -I

As follows:

Array
(
    [0] => Array
        (
            [0] => <li>A</li>
            [1] => Array
                (
                    [0] => <li>B</li>
                    [1] => Array
                        (
                            [0] => <li>C</li>
                        )

                    [2] => <li>D</li>
                )

            [2] => Array
                (
                    [1] => <li>E</li>

                    [2] => Array
                        (
                            [1] => <li>F</li>
                        )

                    [3] => <li>G</li>
                )

            [3] => Array
                (
                    [2] => <li>H</li>
                    [3] => Array
                        (
                            [2] => <li>I</li>
                        )

                )

        )

)

Thanks for your time - any assistance in outputting the array correctly would be greatly appreciated!

Community
  • 1
  • 1
MrJ
  • 1,910
  • 1
  • 16
  • 29
  • It would be *so* much easier if you’d use a [proper HTML parser](http://stackoverflow.com/q/292926/53114). – Gumbo Jan 27 '12 at 11:46
  • What do you mean? The HTML looks valid to me (I know there is no opening or closing
      , but if they are there it simply didn't start to process it). The reason why I've done it this was is so I can process the contents of the
    • items - where I will need to use regex, amongst other things...
    – MrJ Jan 27 '12 at 11:52
  • Not the most helpful comment, but I would echo Gumbo here. I did a similar thing where I parsed an HTML table into a csv file, and simplexml was brilliant for it, the only hitch was that I had to make sure the HTML was well formed first, but that was just a matter of stripping out any attributes and making sure all tags were lowercase. – dartacus Jan 27 '12 at 12:32
  • @dartacus `just a matter of stripping out any attributes and making sure all tags were lowercase` - really? *Just?* If you can parse through the string well enough to do that with guaranteed accuracy, couldn't you just have built your CSV table in the same process? – DaveRandom Jan 27 '12 at 12:52
  • @DaveRandom You'd think, wouldn't you? It's a pro-tem solution until my content authors get around to making csv or xls files for people to download directly. You can see an example here: http://www.hesa.ac.uk/component/option,com_studrec/task,show_file/Itemid,233/mnl,11051/href,a%5E_%5ENHSBURSARY.html/ (at the bottom, 'valid entries') – dartacus Jan 28 '12 at 13:28

2 Answers2

3

IF your lists are always well formed, you could use this to do what you want. It uses SimpleXML so it will not be forgiving of mistakes and bad form in the input code. If you want to be forgiving, you will need to use DOM - the code will be a little more complex, but not ridiculously so.

function ul_to_array ($ul) {
  if (is_string($ul)) {
    if (!$ul = simplexml_load_string("<ul>$ul</ul>")) {
      trigger_error("Syntax error in UL/LI structure");
      return FALSE;
    }
    return ul_to_array($ul);
  } else if (is_object($ul)) {
    $output = array();
    foreach ($ul->li as $li) {
      $output[] = (isset($li->ul)) ? ul_to_array($li->ul) : (string) $li;
    }
    return $output;
  } else return FALSE;
}

It takes data in the exact form as provided in the question - with no outer enclosing <ul> tags. If you want to pass the outer <ul> tags as part of the input string, just change

if (!$ul = simplexml_load_string("<ul>$ul</ul>")) {

to

if (!$ul = simplexml_load_string($ul)) {

See it working

DaveRandom
  • 87,921
  • 11
  • 154
  • 174
  • This looks good Dave - is it possible to have HTML within the
  • tags, passing that into the output array, rather than it being converted to XML, for example, if I had Bla bla bla - it would be preserved as this? So instead of B in the output array, it would contain everything that is not a
      if you see what i mean...?
  • – MrJ Jan 27 '12 at 12:56
  • @MrJ That is likely to break it. You could try replacing the line inside the `foreach` with this `$output[] = (isset($li->ul)) ? ul_to_array($li->ul) : (($li->count()) ? $li->children()->asXML() : (string) $li);` but with real world HTML you would be better with a DOM-based solution. – DaveRandom Jan 27 '12 at 13:11
  • Thanks @DaveRandom - I'll bare that in mind if I run into any difficulties, would the DOM-based one be similar to the other answer? – MrJ Jan 27 '12 at 13:22
  • @MrJ Thats certainly a very good starting point, I haven't analysed it properly but it looks on the face of it to be the right approach. – DaveRandom Jan 27 '12 at 13:24