0

I need to fetch lessons from an online timetable(for a school) into an array. So i can insert the data into my database. The online timetable(url: roosters-hd.stenden.com) looks like this:

On the left hand we see the times, and on top the schooldays (Mo, Tu, We, Th, Fr). Very basic.

Each lesson contains 6 values that i need to fetch.

Besides that, I also need to fetch the [startDate] and [endDate]. Time is based on which row the lesson-cell is at, and how much rowspan it has. Date can be calculated by adding column number onto the start date(printed on top). So in the end the array would look something like this:

[0] => Array
        (
            [0] => Array
                (
                    [Name] => Financiering
                    [Type] => WC
                    [Code] => DECBE3
                    [Classroom] => E2.053 - leslokaal
                    [Teacher] => Verboeket, Erik (E)
                    [Class] => BE1F, BE1B, BE1A
                    [StartDate] => 04/06/2013 08:30:00
                    [EndDate] => 04/06/2013 10:00:00
                )
                etc.

Because my lack of experience in fetching data, I will properly end up with a highly inefficient and inflexible solution. Like should i use XML-parser? Or Regex? Any ideas on how to tackle this problem?

JasperJ
  • 1,192
  • 1
  • 12
  • 23
  • please **not** regex! http://stackoverflow.com/a/1732454/2170192 – Alex Shesterov Jul 11 '13 at 19:49
  • yes not regex, regex is for parsing strings it is very powerful but still it should not be used for this kind of parsing. Also link you posted returns 400 bad request. It would be good to see live example, you can put it in jsfiddle.net – Vladimir Bozic Jul 11 '13 at 19:53
  • Fixed link. I don't have any example right now, since i'm not sure where i should start. With that i mean, the correct efficient way of fetching the data. – JasperJ Jul 11 '13 at 19:56

1 Answers1

2

The regex way:

<pre><?php
$html = file_get_contents('the_url.html');

$clean_pattern = <<<'LOD'
~
  # definitions
    (?(DEFINE)
        (?<start>         <!--\hSTART\hOBJECT-CELL\h-->                    ) 
        (?<end>           (?>[^<]++|<(?!!--))*<!--\hEND\hOBJECT-CELL\h-->  )

        (?<next_cell>     (?>[^<]++|<(?!td\b))*<td[^>]*+>  ) 
        (?<cell_content>  [^<]*+                           )
    )

  # pattern
    \g<start>
        \g<next_cell>     (?<Name>      \g<cell_content>   )  
        \g<next_cell>     (?<Type>      \g<cell_content>   )
        \g<next_cell>     (?<Code>      \g<cell_content>   )

        \g<next_cell>     (?<Classroom> \g<cell_content>   )
        \g<next_cell>

        \g<next_cell>     (?<Teacher>   \g<cell_content>   )
        \g<next_cell>     
        \g<next_cell>     (?<Class>     \g<cell_content>   )
    \g<end>
~x
LOD;

preg_match_all($clean_pattern, $html, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
    echo <<<LOD
    Name: {$match['Name']}
    Type: {$match['Type']}
    Code: {$match['Code']}
    Classroom: {$match['Classroom']}
    Teacher: {$match['Teacher']}
    Class: {$match['Class']}<br/><br/>
LOD;
}

The DOM/XPath way:

$doc = new DOMDocument();
@$doc->loadHTMLFile('the_url.html');
$xpath = new DOMXPath($doc);
$elements = $xpath->query("//*[comment() = ' START OBJECT-CELL ']");
$fields = array('Name', 'Type', 'Code', 'Classroom', 'Teacher', 'Class');
$not_needed = array(10,8,6,1,0);    
foreach ($elements as $element) {
    $temp = explode("\n", $element->nodeValue);
    foreach ($not_needed as $val) { unset($temp[$val]); }
    array_walk($temp, function (&$item){ $item = trim($item); });
    $result[] = array_combine($fields, $temp);
}   
print_r ($result);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I tried your raw pattern in Rubular, but it doesn't seem to match anything. http://rubular.com/r/xwfwYKy13S . – JasperJ Jul 11 '13 at 20:37
  • 1
    @JasperJ: rubular is for ruby not for php, the best test you can do is IN YOUR CODE! Otherwise, you can use http://regex.larsolavtorvik.com/ which is designed for php. – Casimir et Hippolyte Jul 11 '13 at 20:43
  • Right, stupid me. I tried preg_match_all($raw_pattern, $data, $out); With data being file_get_content from url. But still no success (php 5.3.26). But I will wait for updates. – JasperJ Jul 11 '13 at 21:15
  • I like your xPath version. Still I need to be able to calculate start and end date. Is there any way to fetch: column name/number, row name/number and rowspan of the cell? – JasperJ Jul 11 '13 at 23:11
  • @JasperJ: I have just see that you need start and end date time. – Casimir et Hippolyte Jul 11 '13 at 23:16