Read out specific words out of complex xml

Question

I'm really new to PHP so please understand my ignorance.

I'm trying to code a webapp which reads out certain dishes of an xml which is generated by our universities canteen homepage. Their menu is overloaded and really bad in design, so i'm building a mobile optimized webapp as a project in my webdesign class. The webapp will read out only the name of the dish and its price and leave the rest behind. I'm familiar with html/css/javascript and started reading a bit into php, but unfortunately I cant figure out how to get only the important information out of their rss feed.

Their RSS is here: RSS Feed of the canteen

The code I have until now:

<?php 
$xmlfile='http://www.studentenwerk-berlin.de/speiseplan/rss/htw_wilhelminenhof/tag/lang/0000000000000000000000000';
$xml = simplexml_load_file(rawurlencode($xmlfile));

$result = $xml->channel->item->description;
?>

(I know this isnt much...) So I figured out how to load the xml and I found under which path to look for the dishes. They're in "description". But now the Problem is, that theses dishes are not lying well ordered in subpaths, but all in one line in "description". (See the XML from above) How can I access for example all salads (Salate) and put them into an array to be able to format them later into a new table?

This is how the original table looks on their website: Canteen

(I know that you have to ask the owner, before reading something of a website. This app is only for an exercise at university.)

possible duplicate of [How to parse CDATA HTML-content of XML using SimpleXML?](http://stackoverflow.com/questions/15849209/how-to-parse-cdata-html-content-of-xml-using-simplexml) — hakre, Apr 24 '14 at 13:35
The HTML inside the RSS-XML `` needs another object if you want to parse it again. The linked duplicate question has an answer that shows how you can do it. The reference for XML and HTML parsing with PHP is: [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/q/3577641/367456) — hakre, Apr 24 '14 at 13:36

score 0 · Accepted Answer · answered Apr 24 '14 at 14:39

Instead of an array, you can also approach this with an Iterator that encapsulates the logic to traverse the descriptions HTML for the meals. It's simple to use as it sheds away the complexity of doing the parsing.

Here is an example followed by the output:

$uri = 'http://www.studentenwerk-berlin.de/speiseplan/rss/htw_wilhelminenhof/tag/lang/0000000000000000000000000';
$rss = simplexml_load_file($uri);
$meals = new MealIterator($rss->channel->item->description, 'Salate');
foreach ($meals as $entry) {
    vprintf("%s - %s\n", $entry);
}

Output:

Große Salatschüssel mit gekochtem Ei - EUR 1.55 / 2.50 / 3.25
Kleine Salatschale - EUR 0.55 / 0.90 / 1.15
Doppelt-Große Salatschale - EUR 2.95 / 4.70 / 6.20
Große Salatschale - EUR 1.55 / 2.50 / 3.25

The iterator makes use of PHP's built in DOM functionality, namely DOMDocument and DOMXpath. The first step is to obtain the table that contains one meal per each row. This is done with xpath in the constructor already:

public function __construct($html, $meal)
{
    $doc   = $this->createHtmlDoc($html);
    $xpath = new DOMXPath($doc);
    $expr  = sprintf('//th[.=%s]/../../following-sibling::tr', $this->xpathString($meal));
    $items = $xpath->query($expr);
    if ($items === FALSE) {
        throw new UnexpectedValueException('Failed to query the HTML document');
    }
    parent::__construct($items);
}

The key power to use here is Xpath. It will return a result that is one <tr> each containing one meal.

Still the data of each meal needs to be extracted. This is done in the current method of the iterator then:

public function current()
{
    $entry = parent::current();
    $tds   = $entry->getElementsByTagname('td');
    $name  = $this->childTextContent($tds->item(0));
    $price = trim($tds->item(1)->textContent);
    return compact("name", "price");
}

This is using merely DOMElement traversal methods (documented in the manual) and as this was a bit harder to parse, another quickly written helper method fetching only direct child text nodes content for the name of the meal:

private function childTextContent(DOMNode $node)
{
    $buffer = '';
    foreach ($node->childNodes as $child) {
        if ($child instanceof DOMText) {
            $buffer .= $child->textContent;
        }
    }
    return trim($buffer);
}

(You can see the full code of the iterator.)

Key points in this solution:

Encapsulate the parsing in an iterator - if the source changes, the parsing might change as well - but not the whole program.
Re-use existing libraries like simplexml and the sister library domdocument.
Solve the problem by dividing from big into small.

If you now say, you want to have an iterator instead of an array, it's pretty close, convert the iterator into an array:

print_r(iterator_to_array($meals, false));

Array
(
    [0] => Array
        (
            [name] => Große Salatschüssel mit gekochtem Ei
            [price] => EUR 1.55 / 2.50 / 3.25
        )

    [1] => Array
        (
            [name] => Kleine Salatschale
            [price] => EUR 0.55 / 0.90 / 1.15
        )

    [2] => Array
        (
            [name] => Doppelt-Große Salatschale
            [price] => EUR 2.95 / 4.70 / 6.20
        )

    [3] => Array
        (
            [name] => Große Salatschale
            [price] => EUR 1.55 / 2.50 / 3.25
        )

)

The routine to create an xpath string is from: Mitigating XPath Injection Attacks in PHP

You can use xpath to cast the text content of a node to a string: `$text = $xpath->evaluate('string(td[0])', $entry)` — ThW, Apr 24 '14 at 14:57
@ThW: Yes (with td[1], it isn't zero based), but that would be the same as textContent which is not wanted. Instead, the first child text node which it's space normalized text is non-empty should be used. using normalize space again returns proper then: `$text = $xpath->evaluate('normalize-space((td[1]/text()[normalize-space() != ""])[1])', $entry)` — hakre, Apr 24 '14 at 15:53
wow, thats awesome. It works perfectly. The only thing I dont understand is, whats the name of the array and how can I keep it from printing? My plan is to generate all 6 arrays and then using javascript to calculate the size of the final table and filling it. (If this is the right way to do) (The Rss changes daily and the number of meals also changes) Thank you so much hakre! — vinni, Apr 24 '14 at 15:56
`$nameItAsYouLoveIt = iterator_to_array($meals, false);` :) - However it's perhaps worth to have another iterator for all sections so that you can stack that together. But that's perhaps a little too much for the beginning. — hakre, Apr 24 '14 at 15:56
@hakre I added `$salateReihe = iterator_to_array($meals, false); print_r($salateReihe);` But it gives me only an empty array ("array()") — vinni, Apr 24 '14 at 16:20

Read out specific words out of complex xml

1 Answers1