0

Good day dear community!

I need to build a function which parses the content of a very simple Table (with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:

The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190

Well i tried it with this one

<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
    $line = trim($line);
    $content = file_get_contents($line);
    //echo ($content)."<br>";
    $pattern = '/<td>(.*?)<\/td>/si';
    preg_match_all($pattern,$content,$matches);

    foreach ($matches[1] as $match) {
        $match = strip_tags($match);
        $match = trim($match);
        //var_dump($match);
        $sql = mysqli_query("insert into tablename(contents) values ('$match')");
        //echo $match;
    }
}
?>

Well - see the regex in line 7-11: it does not match!

Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.

Can anybody help me here to get a better regex - or a better way to parse this site ... Any and all help will be greatly apprecaited.

regards zero

zero
  • 1,003
  • 3
  • 20
  • 42
  • Do the td's have attributes or other sruff? What about an XML parser? –  Dec 19 '10 at 11:08
  • `` appears nowhere in the webpage you're parsing. – Dan Grossman Dec 19 '10 at 11:08
  • Have a look at http://simplehtmldom.sourceforge.net/ (for your html parsing needs) – Andreas Dec 19 '10 at 11:10
  • hello Dan - right - i have to rework the parser. But - doesn ´t the page have some invalid markup? Time Machine & Dan - what would you suggest here!? I need some starting points. Many thanks for any and help here. I need to rebuild this script – zero Dec 19 '10 at 11:12
  • Hello all again - i tried to work with simplehtmldom - but i guess that this cannt handle invaild html!? BTW @Dan: i have some >td´s>see Schulnummer - well - but this code does not handle all parsing well – zero Dec 19 '10 at 11:16
  • *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Dec 19 '10 at 12:05
  • 1
    Agreed. I will not write code for you if you never accept answers. – Dan Grossman Dec 20 '10 at 02:40

2 Answers2

0

You could use tear the table apart using preg_split('/<td width="73%">&nbsp;/', $str, -1); (note; i did not bother escaping characters)

You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .

This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.

DamnYankee
  • 86
  • 1
  • 5
  • Good day Damn Yankee, many many thanks for sharing your approach!By dropping the first entry you mean that i should drop my approach. I substitute it by using preg_split('/ /', $str, -1); Afterwards i need to use stripos and substr to cut away all the stuff that i do not need. I will try it out later this day! many thanks for sharing your ideas here!! – zero Dec 19 '10 at 11:25
0

Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.

I have used Simple HTML DOM Parser in past and it worked for me.

For Example:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all <td> in <table> which class=hello 
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags 
$es = $html->find('table td[align=center]');
Community
  • 1
  • 1
Naveed
  • 41,517
  • 32
  • 98
  • 131