web page scrape issue

Question

I have a big problem. I want to parse a web page using php. And I don't understand why it doesn't work. I want to take the "tr" tags from that page, and then, I'll parse each text obtain previously, by the "td" tags. The thing is that I can't parse the text so between two tags can have another two.

Is there any trick about wich I should know? Beacuse I'm trying this for over 2 days and I still can't get a result.

This is the page:

http://www.tjareborg.fi/akkilahdot?DepartureIds=-1&CtryId=-1&DestinationAirportIds=-1&ResId=-1&QueryDurID=a&QueryDepDate=10.6.2011&LmsTypeId=2%2c3%2c1&PaxPrice=2167&SortAscending=True&page=0

All I want to do is parse that table, and get the content of every cell.

Thank you so much!!!

*(related)* [Best methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Jun 10 '11 at 09:46
You might want to point out what you have already tried and show us some code. StackOverflow has many examples how to parse HTML and right now your question gets across like gimme-teh-codez. — Gordon, Jun 10 '11 at 09:54
*(related)* [Robust and Mature HTML parser for PHP](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) — Gordon, Jun 10 '11 at 09:57

Yoshi · Answer 1 · 2011-06-10T09:55:40.670

Try:

libxml_use_internal_errors(true);

$url = '%your url%';
$dom = new DOMDocument;
$dom->loadHTML(file_get_contents($url));

libxml_clear_errors();

$xpath = new DOMXPath($dom);
$rows = array();
foreach ($xpath->query('//*[@id="tblLmsList"]//tr') as $tr) {
    $cells = array();
    foreach ($xpath->query('td', $tr) as $td) {
        $cells[] = trim($td->nodeValue);
    }

    if (sizeof($cells) > 0) {
        $rows[] = $cells;
    }
}

print_r($rows);

Output:

Array
(
    [0] => Array
        (
            [0] => la 11.6.
            [1] => Varna
                Bulgaria
            [2] => Helsinki
            [3] => Matkajokeri
            [4] => 175,-
            [5] => 
            [6] => -
            [7] => 
            [8] => -
            [9] => 
        )

    [1] => Array
        (
            [0] => la 11.6.
            [1] => Varna
                Bulgaria
            [2] => Helsinki
            [3] => Pelkät lennot
            [4] => 150,-
            [5] => 
            [6] => -
            [7] => 
            [8] => -
            [9] => 
        )

...

dont use error suppression. use [`libxml_use_internal_errors`](http://nl2.php.net/manual/en/function.libxml-use-internal-errors.php) and [`libxml_clear_errors`](http://nl2.php.net/manual/en/function.libxml-clear-errors.php) — Gordon, Jun 10 '11 at 09:52
That works!! Thank you so much. You saved me! I'll start learning more about DOMDocument's. It seems it works in this case. — Gigg, Jun 10 '11 at 10:04

score 1 · Answer 2 · answered Jun 10 '11 at 09:44

1

Try having a look at http://simplehtmldom.sourceforge.net/

answered Jun 10 '11 at 09:44

Nick Fortescue

43,045
26
106
134

Besides being hardly an answer because it doesnt show the OP how to achive his goal, SimpleHTMLDom is a poor choice for a parser. It's slow, has a crappy codebase and is not based on libxml. See my link below the question for better alternatives to SimpleHtmlDom. – Gordon Jun 10 '11 at 09:48

web page scrape issue

2 Answers2