0

I am doing some php html parsing and this is the code i have right now

function get_tag($htmlelement,$attr, $value, $xml ,$arr) {
    $attr = preg_quote($attr);
    $value = preg_quote($value);
    if($attr!='' && $value!='')
    {
    $tag_regex = '/<'.$htmlelement.'[^>]*'.$attr.'="'.$value.'">(.*?)<\\/'.$htmlelement.'>/si';
    preg_match($tag_regex,$xml,$matches);
    }
    else
    {
    $tag_regex = '/'.$htmlelement.'[^>]*"(.*?)\/'.$htmlelement.'/i';
    preg_match_all($tag_regex,$xml,$matches);
    }
    if($arr)
        return $matches;
    else 
        return $matches[1];
}
$htmlcontent = file_get_contents("doc.html");
$extract = get_tag('tbody','id', 'open', $htmlcontent,false);

$trows = get_tag('tr','', '', $htmlcontent,false);

The rows that has to be parsed/ the content in $extract can be viewed here http://pastebin.com/ydiAdiuC.

Basically, i am reading the html content and getting the tag tbody from the html. Now i want to take each tr and td values in the tbody and use it in my page. Any idea how to use, i think i am not using the right method of implementing preg_match_all.

Joe
  • 610
  • 7
  • 21
  • 1
    Relevant answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – arnemart Jun 16 '11 at 12:53

1 Answers1

7

Use PHP's DOM Parsers for this. Not Regular Expressions.

A quick approach:

  • Load in the HTML
  • Get the tbody tag.
  • Get the tr tags within.
Jason McCreary
  • 71,546
  • 23
  • 135
  • 174
  • Could you give me a short code? The html tags aren't closed proper and i have no control on the htmlcontent. – Joe Jun 16 '11 at 13:02
  • 1
    @joza: run [Tidy](http://php.net/manual/en/book.tidy.php) over it first in case it's totally broken. Otherwise tell DomDocument to ignore errors. – hakre Jun 16 '11 at 13:06
  • @joza, invalid markup will be an issue. See **hakre**'s comment for a way to get around this. Invalid markup would be a nightmare for regular expressions and one of the main reasons they have trouble parsing HTML. – Jason McCreary Jun 16 '11 at 13:39