Regex issue with multiple results

Question

I am doing some php html parsing and this is the code i have right now

function get_tag($htmlelement,$attr, $value, $xml ,$arr) {
    $attr = preg_quote($attr);
    $value = preg_quote($value);
    if($attr!='' && $value!='')
    {
    $tag_regex = '/<'.$htmlelement.'[^>]*'.$attr.'="'.$value.'">(.*?)<\\/'.$htmlelement.'>/si';
    preg_match($tag_regex,$xml,$matches);
    }
    else
    {
    $tag_regex = '/'.$htmlelement.'[^>]*"(.*?)\/'.$htmlelement.'/i';
    preg_match_all($tag_regex,$xml,$matches);
    }
    if($arr)
        return $matches;
    else 
        return $matches[1];
}
$htmlcontent = file_get_contents("doc.html");
$extract = get_tag('tbody','id', 'open', $htmlcontent,false);

$trows = get_tag('tr','', '', $htmlcontent,false);

The rows that has to be parsed/ the content in $extract can be viewed here http://pastebin.com/ydiAdiuC.

Basically, i am reading the html content and getting the tag tbody from the html. Now i want to take each tr and td values in the tbody and use it in my page. Any idea how to use, i think i am not using the right method of implementing preg_match_all.

Relevant answer: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — arnemart, Jun 16 '11 at 12:53

score 7 · Accepted Answer · answered Jun 16 '11 at 12:51

7

Use PHP's DOM Parsers for this. Not Regular Expressions.

A quick approach:

Load in the HTML
Get the tbody tag.
Get the tr tags within.

answered Jun 16 '11 at 12:51

Jason McCreary

71,546
23
135
174

Could you give me a short code? The html tags aren't closed proper and i have no control on the htmlcontent. – Joe Jun 16 '11 at 13:02
1

@joza: run [Tidy](http://php.net/manual/en/book.tidy.php) over it first in case it's totally broken. Otherwise tell DomDocument to ignore errors. – hakre Jun 16 '11 at 13:06
@joza, invalid markup will be an issue. See **hakre**'s comment for a way to get around this. Invalid markup would be a nightmare for regular expressions and one of the main reasons they have trouble parsing HTML. – Jason McCreary Jun 16 '11 at 13:39

Regex issue with multiple results

1 Answers1