How can I parse a very simple Table using PHP

Question

Good day dear community!

I need to build a function which parses the content of a very simple Table (with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:

The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190

Well i tried it with this one

<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
    $line = trim($line);
    $content = file_get_contents($line);
    //echo ($content)."<br>";
    $pattern = '/<td>(.*?)<\/td>/si';
    preg_match_all($pattern,$content,$matches);

    foreach ($matches[1] as $match) {
        $match = strip_tags($match);
        $match = trim($match);
        //var_dump($match);
        $sql = mysqli_query("insert into tablename(contents) values ('$match')");
        //echo $match;
    }
}
?>

Well - see the regex in line 7-11: it does not match!

Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.

Can anybody help me here to get a better regex - or a better way to parse this site ... Any and all help will be greatly apprecaited.

regards zero

Do the td's have attributes or other sruff? What about an XML parser? — , Dec 19 '10 at 11:08
Have a look at http://simplehtmldom.sourceforge.net/ (for your html parsing needs) — Andreas, Dec 19 '10 at 11:10
hello Dan - right - i have to rework the parser. But - doesn ´t the page have some invalid markup? Time Machine & Dan - what would you suggest here!? I need some starting points. Many thanks for any and help here. I need to rebuild this script — zero, Dec 19 '10 at 11:12
Hello all again - i tried to work with simplehtmldom - but i guess that this cannt handle invaild html!? BTW @Dan: i have some >td´s>see Schulnummer - well - but this code does not handle all parsing well — zero, Dec 19 '10 at 11:16
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Dec 19 '10 at 12:05
Agreed. I will not write code for you if you never accept answers. — Dan Grossman, Dec 20 '10 at 02:40

score 0 · Answer 1 · answered Dec 19 '10 at 11:20

0

You could use tear the table apart using preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)

You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .

This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.

answered Dec 19 '10 at 11:20

DamnYankee

86
1
5

Good day Damn Yankee, many many thanks for sharing your approach!By dropping the first entry you mean that i should drop my approach. I substitute it by using preg_split('/ /', $str, -1); Afterwards i need to use stripos and substr to cut away all the stuff that i do not need. I will try it out later this day! many thanks for sharing your ideas here!! – zero Dec 19 '10 at 11:25

score 0 · Answer 2 · edited May 23 '17 at 11:47

0

Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.

I have used Simple HTML DOM Parser in past and it worked for me.

For Example:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]');

edited May 23 '17 at 11:47

Community

1
1

answered Dec 19 '10 at 12:33

Naveed

41,517
32
98
131

hi Naveed - many thanks for the hints. I will try it out later the weekend – zero Dec 19 '10 at 17:33

How can I parse a very simple Table using PHP

2 Answers2