How can I extract structured text from an HTML list in PHP?

Question

I have this string:

<ul>
  <li id="1">Page 1</li>
  <li id="2">Page 2
    <ul>
      <li id="3">Sub Page A</li>
      <li id="4">Sub Page B</li>
      <li id="5">Sub Page C
        <ul>
          <li id="6">Sub Sub Page I</li>
        </ul>
      </li>
    </ul>
  </li>
  <li id="7">Page 3
    <ul>
      <li id="8">Sub Page D</li>
    </ul>
  </li>
  <li id="9">Page 4</li>
</ul>

and I want to explode every information with PHP and make it like:

----------------------------------
| ID | ORDER | PARENT | CHILDREN |
----------------------------------
|  1 |   1   |   0   |     0     |
|  2 |   2   |   0   |   3,4,5   |
|  3 |   1   |   2   |     0     |
|  4 |   2   |   2   |     0     |
|  5 |   3   |   2   |     6     |
|  6 |   1   |   5   |     0     |
|  7 |   3   |   0   |     8     |
|  8 |   1   |   7   |     0     |
|  9 |   4   |   0   |     0     |
----------------------------------

For extra information, this is what this list means for me:

ID 1 is 1st (Page 1) and has 0 parents and 0 children,

ID 2 is 2nd (Page 2) and has 0 parents and children IDs 3,4,5,

ID 3 is 1st (Sub Page A) and has parent ID 2 and 0 children,

ID 4 is 2nd (Sub Page B) and has parent ID 2 and 0 children,

ID 5 is 3rd (Sub Page C) and has parent ID 2 and children ID 6,

ID 6 is 1st (Sub Page I) and has parent ID 5 and 0 children,

ID 7 is 3th (Page 3) and has 0 parents and children ID 8,

ID 8 is 1st (Sub Page I) and has parent ID 7 and 0 children,

ID 9 is 4th (Page 4) and has 0 parents and 0 children.

If this is too tough, can anyone sugest how to get that info from this string with another method?

[Use DOMDocument](http://php.net/manual/en/class.domdocument.php). — moonwave99, Dec 29 '12 at 17:49

score 2 · Answer 1 · answered Dec 29 '12 at 17:54

2

That's not "a string", it's HTML. You need to use an HTML parser like DOMDocument or simple_html_dom.

See examples at http://htmlparsing.com/php.html

answered Dec 29 '12 at 17:54

Andy Lester

91,102
13
100
152

No, I really mean a PHP string $my_str = '
';
– Dee001 Dec 29 '12 at 17:55
1

It is a string, but it is a string containing data in a standardised mark-up (HTML), therefore it makes sense to use existing parsers on it rather than write your own. – Philip Whitehouse Dec 29 '12 at 17:56
1

Yes, the HTML happens to be in a PHP string, but the string contains HTML, and you need to parse that HTML with an HTML parser. If anyone tells you to use regular expressions, ignore them. Use an HTML parser. – Andy Lester Dec 29 '12 at 17:57

score 1 · Answer 2 · edited May 23 '17 at 12:21

You could divide the problem here. The one thing would be to parse the HTML, this is most easily done with DOMDocument and DOMXpath here. That is running some mapping in context of the result of another xpath expression / query. Sounds maybe a bit complicated, but it is not. In a more simplified variant you can find this outlined in a previous answer to Get parent element through xpath and all child elements.

In your case this is a bit more complicate, some pseudo-code. I added the label because it makes things more visible for demonstration purposes:

foreach //li ::
    ID       := string(./@id)
    ParentID := string(./ancestor::li[1]/@id)
    Label    := normalize-space(./text()[1])

As this shows, this returns the bare data only. You also have the Order and the Children. Normally the Children listing is not needed (I keep it here anyway). What is similar between the Order value and the Children value is that they are retrieved from context.

E.g. while traversing the //li nodelist in document order, the order of each children can be numbered if a counter is kept per each ParentID.

Similar with the Children, like a counter, that value needs to be build while iterating over the list. Only at the very end the correct value for each listitem is available.

So those two values are in a context, I create that context in form of an array keyed by ParentID: $parents. Per each ID it will contain two entries: 0 containing the counter for Order and 1 containing an array to keep the IDs of Children (if any).

Note: Technically this is not totally correct. The Order and Children should be expressible in pure xpath as well, I just didn't do it in this example to show how to add your own non-xpath context as well, e.g. if you want a different ordering or children handling.

Enough with the theory. Considering the standard setup:

$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);

The said mapping incl. it's context can be written as an anonymous function:

$parents = [];

$map = function (DOMElement $li) use ($xp, &$parents) {

    $id       = (int)$xp->evaluate('string(./@id)', $li);
    $parentId = (int)$xp->evaluate('string(./ancestor::li[1]/@id)', $li);
    $label    = $xp->evaluate('normalize-space(./text()[1])', $li);

    isset($parents[$parentId][0]) ? $parents[$parentId][0]++ : ($parents[$parentId][0] = 1);
    $order                   = $parents[$parentId][0];
    $parents[$parentId][1][] = $id;
    isset($parents[$id][1]) || $parents[$id][1] = [];

    return array($id, $label, $order, $parentId, &$parents[$id][1]);
};

As you can see it first contains the retrieval of the values like in the pseudo-code and in the second part the handling of the context values. It's merely to initialize the context for the ID / ParentID if it yet does not exists.

This mapping needs to be applied:

$result = [];
foreach ($xp->query('//li') as $li) {
    list($id) = $array = $map($li);
    $result[$id] = $array;
}

Which will make $result contain the listing of items and $parents the context data. As a reference is used, the Children value needs to be imploded now, then the references can be removed:

foreach ($parents as &$parent) {
    $parent[1] = implode(',', $parent[1]);
}
unset($parent, $parents);

This then makes $result the final result which can be output:

echo '+----+----------------+-------+--------+----------+
| ID |     LABEL      | ORDER | PARENT | CHILDREN |
+----+----------------+-------+--------+----------+
';
foreach ($result as $line) {
    vprintf("| %' 2d | %' -14s |  %' 2d   |   %' 2d   | %-8s |\n", $line);
}
echo '+----+----------------+-------+--------+----------+
';

Which then looks like:

+----+----------------+-------+--------+----------+
| ID |     LABEL      | ORDER | PARENT | CHILDREN |
+----+----------------+-------+--------+----------+
|  1 | Page 1         |   1   |    0   |          |
|  2 | Page 2         |   2   |    0   | 3,4,5    |
|  3 | Sub Page A     |   1   |    2   |          |
|  4 | Sub Page B     |   2   |    2   |          |
|  5 | Sub Page C     |   3   |    2   | 6        |
|  6 | Sub Sub Page I |   1   |    5   |          |
|  7 | Page 3         |   3   |    0   | 8        |
|  8 | Sub Page D     |   1   |    7   |          |
|  9 | Page 4         |   4   |    0   |          |
+----+----------------+-------+--------+----------+

You can find the Demo online here.

score 0 · Answer 3 · answered Dec 29 '12 at 21:52

I leave a second answer because this time this demonstrates how to do it with the single mapping (in pseudocode):

foreach //li ::
    ID       := string(./@id)
    ParentID := string(./ancestor::li[1]/@id)
    Label    := normalize-space(./text()[1])
    Order    := count(./preceding-sibling::li)+1
    Children := implode(",", ./ul/li/@id)

Because this can be done per each li node regardless in which order, this could be a perfect match for an Iterator, here the current function:

public function current() {

    return [
        'ID'       => $this->evaluate('number(./@id)'),
        'label'    => $this->evaluate('normalize-space(./text()[1])'),
        'order'    => $this->evaluate('count(./preceding-sibling::li)+1'),
        'parentID' => $this->evaluate('number(concat("0", ./ancestor::li[1]/@id))'),
        'children' => $this->implodeNodes(',', './ul/li/@id'),
    ];
}

Full example (Demo) output and code:

+----+----------------+-------+--------+----------+
| ID |     LABEL      | ORDER | PARENT | CHILDREN |
+----+----------------+-------+--------+----------+
|  1 | Page 1         |   1   |    0   |          |
|  2 | Page 2         |   2   |    0   | 3,4,5    |
|  3 | Sub Page A     |   1   |    2   |          |
|  4 | Sub Page B     |   2   |    2   |          |
|  5 | Sub Page C     |   3   |    2   | 6        |
|  6 | Sub Sub Page I |   1   |    5   |          |
|  7 | Page 3         |   3   |    0   | 8        |
|  8 | Sub Page D     |   1   |    7   |          |
|  9 | Page 4         |   4   |    0   |          |
+----+----------------+-------+--------+----------+


class HtmlListIterator extends IteratorIterator
{
    private $xpath;

    public function __construct($html) {

        $doc = new DOMDocument();
        $doc->loadHTML($html);
        $this->xpath = new DOMXPath($doc);
        parent::__construct($this->xpath->query('//li'));
    }

    private function evaluate($expression) {

        return $this->xpath->evaluate($expression, parent::current());
    }

    private function implodeNodes($glue, $expression) {

        return implode(
            $glue, array_map(function ($a) {

                return $a->nodeValue;
            }, iterator_to_array($this->evaluate($expression, parent::current())))
        );
    }

    public function current() {

        return [
            'ID'       => $this->evaluate('number(./@id)'),
            'label'    => $this->evaluate('normalize-space(./text()[1])'),
            'order'    => $this->evaluate('count(./preceding-sibling::li)+1'),
            'parentID' => $this->evaluate('number(concat("0", ./ancestor::li[1]/@id))'),
            'children' => $this->implodeNodes(',', './ul/li/@id'),
        ];
    }
}

print_result(new HtmlListIterator($html));

function print_result($result) {

    echo '+----+----------------+-------+--------+----------+
| ID |     LABEL      | ORDER | PARENT | CHILDREN |
+----+----------------+-------+--------+----------+
';
    foreach ($result as $line) {
        vprintf("| %' 2d | %' -14s |  %' 2d   |   %' 2d   | %-8s |\n", $line);
    }
    echo '+----+----------------+-------+--------+----------+
';
}

How can I extract structured text from an HTML list in PHP?

3 Answers3

Linked