0

Its my first post on the site so bear with me

Ok so i'm a complete beginner with PHP and I have a specific need for it for my project. I'm hoping some of you guys could help!

Basically, I want to scrape a webpage and access a certain html table and its information. I need to parse out this info and simply format it in a desired result.

So where to begin..... heres my php I have written so far

<?php

$url = "http://www.goldenplec.com/festivals/oxegen-2/oxegen-2011";
$raw = file_get_contents($url);

$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));

$start = strpos($content,'<table style="background: #FFF; font-size: 13px;"');
$end = strpos($content,'</table>',$start) + 8;

$table = substr($content,$start,$end-$start);

echo $table;


/* Regex here to echo the desired result */


?>

That URL contains the table I need. My code will simply echo that exact table.

However, and heres my problem, I'm by no means a reg-ex expert and I need to display the data from the table in a certain format. I want to echo an xml file containing a number of sql insert statements as follows:

$xml_output .= "<statement>INSERT INTO timetable VALUES(1,'Black Eyed Peas','Main Stage','Friday', '23:15')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(2,'Swedish House Mafia','Vodafone Stage','Friday', '23:30')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(3,'Foo Fighters','Main Stage','Saturday', '23:25')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(4,'Deadmau5','Vodafone Stage','Saturday', '23:05')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(5,'Coldplay','Main Stage','Sunday', '22:25')</statement>";
$xml_output .= "<statement>INSERT INTO timetable VALUES(6,'Pendalum','Vodafone Stage','Sunday', '22:15')</statement>";

I hope I have provided enough info and I would greatly appreciate any help from you kind folk.

Thanks in advance.

elgoog
  • 1,031
  • 1
  • 11
  • 20
  • [Interesting answer.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Jared Farrish Nov 03 '11 at 22:57
  • you should show us some more detailed html output for giving you regex trick :) – ArtoAle Nov 03 '11 at 22:57
  • 3
    You would probably be better off trying a parser instead of regex: http://php.net/manual/en/book.dom.php – Jared Farrish Nov 03 '11 at 22:58
  • 3
    There are existing (easy to google) html table extraction tools (using regex, which *do* happen to be suitable). But if you want to scrape and extract stuff, then use [phpQuery or QueryPath](http://stackoverflow.com/questions/3650125/how-to-parse-html-with-php/3659729#3659729), which allow for `pqhtml($url)->find("table")->...` and separating the values. – mario Nov 03 '11 at 23:03
  • Unfortunately they don't use `th` for headers and subheaders, so you'll most likely have to manually edit this even with the scrape. – Levi Morrison Nov 03 '11 at 23:39
  • @LeviMorrison - By "they" you mean the http://www.goldenplec.com/ people, right? – Jared Farrish Nov 04 '11 at 01:30
  • 1
    Possible duplicate of [Scrape web page contents](http://stackoverflow.com/questions/584826/scrape-web-page-contents) – John Slegers Feb 25 '16 at 16:56

1 Answers1

2

You're much better off using something like XPATH when doing scraping. I get all <TD> elements, identify that the venue is always UPPERCASE, so we can use that to our advantage. We also get a list of days, & some blank spaces, so I skip over those. I identify the start of the acts section via checking for ":", which denotes a time. Given that the event lasts for 3 days & the arrangement of the data interleaves acts for each day, I just increment the day & then reset it when it reaches the last day of the event.

Possibly some character encoding issues going on here, perhaps, but didn't feel like meddling with that too much. There are possibly more elegant solutions out there.

Edit: Just noticed that not all acts are exactly interleaved by 3 days, so this will be more difficult to get the day of the event. The code below will not give accurate days for every act. Mainly "Little Green Cars" & "Touchwood"

Edit2: The code is now updated & should parse all acts properly with correct date. The offending dates that have nothing scheduled are represented by two empty strings(""). We can detect these & increment our $day counter.

<?php

libxml_use_internal_errors(true);

$url = "lineup2011.html";
$rawHTML = file_get_contents($url);

$dom = new DOMDocument();
$dom->loadHTML($rawHTML);


$xpath = new DOMXPath($dom);

$nodeList = $xpath->query("//table//td");

$nodeCount = 0;
$venue = "";
$day = 0;
$acts = array();

while ($nodeCount < $nodeList->length) {
    $value = $nodeList->item($nodeCount)->nodeValue;

    if (isUpper($value) && strpos($value, ":") === false && $value != "") {
        $venue = $value;
        $nodeCount += 7;
        $day = 0;
        continue;
    }

    if ($value == "" && $nodeList->item($nodeCount + 1)->nodeValue == "") {
        $day++;
        $nodeCount += 2;
        continue;
    }

    $act = array();
    $act['time'] = $value;
    $act['name'] = $nodeList->item($nodeCount + 1)->nodeValue;
    $act['venue'] = $venue;

    $act['day'] = $day % 3;


    $day++;

    $acts[] = $act;
    $nodeCount += 2;
}

print_r($acts);


function isUpper($str) {
    return (strtoupper($str) == $str);
}
Klinky
  • 2,084
  • 15
  • 12
  • Wow, thanks so much Klinky!! Just reading through the code, trying to get my head around it. Just one thing... some of the 3rd day (sunday) acts seem to have day set as 0? Whereas some of those Sunday acts are correctly set to day = 3 – elgoog Nov 04 '11 at 23:50
  • But its no biggy, I can modify my client code for the expected results. Once again, thanks guys for your assistance. Excellent Site :) – elgoog Nov 05 '11 at 00:19
  • I had the wrong value for my modulo operations(%). Should be fixed now. Days are 0, 1 & 2, no longer 1, 2 & 3. They should be in the correct order now. – Klinky Nov 05 '11 at 01:41