You could divide the problem here. The one thing would be to parse the HTML, this is most easily done with DOMDocument
and DOMXpath
here. That is running some mapping in context of the result of another xpath expression / query. Sounds maybe a bit complicated, but it is not. In a more simplified variant you can find this outlined in a previous answer to Get parent element through xpath and all child elements.
In your case this is a bit more complicate, some pseudo-code. I added the label because it makes things more visible for demonstration purposes:
foreach //li ::
ID := string(./@id)
ParentID := string(./ancestor::li[1]/@id)
Label := normalize-space(./text()[1])
As this shows, this returns the bare data only. You also have the Order and the Children. Normally the Children listing is not needed (I keep it here anyway). What is similar between the Order value and the Children value is that they are retrieved from context.
E.g. while traversing the //li
nodelist in document order, the order of each children can be numbered if a counter is kept per each ParentID.
Similar with the Children, like a counter, that value needs to be build while iterating over the list. Only at the very end the correct value for each listitem is available.
So those two values are in a context, I create that context in form of an array keyed by ParentID: $parents
. Per each ID it will contain two entries: 0 containing the counter for Order and 1 containing an array to keep the IDs of Children (if any).
Note: Technically this is not totally correct. The Order and Children should be expressible in pure xpath as well, I just didn't do it in this example to show how to add your own non-xpath context as well, e.g. if you want a different ordering or children handling.
Enough with the theory. Considering the standard setup:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
The said mapping incl. it's context can be written as an anonymous function:
$parents = [];
$map = function (DOMElement $li) use ($xp, &$parents) {
$id = (int)$xp->evaluate('string(./@id)', $li);
$parentId = (int)$xp->evaluate('string(./ancestor::li[1]/@id)', $li);
$label = $xp->evaluate('normalize-space(./text()[1])', $li);
isset($parents[$parentId][0]) ? $parents[$parentId][0]++ : ($parents[$parentId][0] = 1);
$order = $parents[$parentId][0];
$parents[$parentId][1][] = $id;
isset($parents[$id][1]) || $parents[$id][1] = [];
return array($id, $label, $order, $parentId, &$parents[$id][1]);
};
As you can see it first contains the retrieval of the values like in the pseudo-code and in the second part the handling of the context values. It's merely to initialize the context for the ID / ParentID if it yet does not exists.
This mapping needs to be applied:
$result = [];
foreach ($xp->query('//li') as $li) {
list($id) = $array = $map($li);
$result[$id] = $array;
}
Which will make $result
contain the listing of items and $parents
the context data. As a reference is used, the Children value needs to be imploded now, then the references can be removed:
foreach ($parents as &$parent) {
$parent[1] = implode(',', $parent[1]);
}
unset($parent, $parents);
This then makes $result
the final result which can be output:
echo '+----+----------------+-------+--------+----------+
| ID | LABEL | ORDER | PARENT | CHILDREN |
+----+----------------+-------+--------+----------+
';
foreach ($result as $line) {
vprintf("| %' 2d | %' -14s | %' 2d | %' 2d | %-8s |\n", $line);
}
echo '+----+----------------+-------+--------+----------+
';
Which then looks like:
+----+----------------+-------+--------+----------+
| ID | LABEL | ORDER | PARENT | CHILDREN |
+----+----------------+-------+--------+----------+
| 1 | Page 1 | 1 | 0 | |
| 2 | Page 2 | 2 | 0 | 3,4,5 |
| 3 | Sub Page A | 1 | 2 | |
| 4 | Sub Page B | 2 | 2 | |
| 5 | Sub Page C | 3 | 2 | 6 |
| 6 | Sub Sub Page I | 1 | 5 | |
| 7 | Page 3 | 3 | 0 | 8 |
| 8 | Sub Page D | 1 | 7 | |
| 9 | Page 4 | 4 | 0 | |
+----+----------------+-------+--------+----------+
You can find the Demo online here.