0

Good evening dear community,

i need some help with preg_match - i want to optimize the code that allready runs very well! i want to get ony the results - not the overhead of HTML-tags in the result That means i have to tailor the regex a bit. How can i improve the (allready very nice) code!?

<?php

$content = file_get_contents("< - URL - >");

var_dump($content);

$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);

foreach ($matches[1] as $match) {
    $match = strip_tags($match);
    $match = trim($match);
    var_dump($match);
}

?>

See here the url: link text

Hmm - i need to tailor the regex a bit... Cany anybody give me.

Each idea and tipp will be greatly appreciated regards zero

zero
  • 1,003
  • 3
  • 20
  • 42
  • Could you explain what you're looking for? Same output but faster processing? Different output? – nickf Dec 09 '10 at 22:19
  • Yeah, what's your question about exactly? – Pekka Dec 09 '10 at 22:23
  • well the html is invalid. I need to have a good regex or another approach - that gives a starting point! Well - the regex i have does not fit 100%. Hmmm in perl there is a way to strip the Table-Tags... isn t´it – zero Dec 09 '10 at 22:42

1 Answers1

0

It appears that you are trying to scrape data from HTML pages. If this is the case, then you really should not use regular expressions to extract information. Take a look instead at the DOMDocument class.

Note that DOMDocument requires XML input, so often a "tidying" process needs to prepare the HTML for being parsed as XML. One convenient way to do this is to use the "tidy" extension. See "Tidying up your HTML with PHP 5" for an introduction to its use.

EDIT: How can I scrape a website with invalid HTML

Community
  • 1
  • 1
Daniel Trebbien
  • 38,421
  • 18
  • 121
  • 193
  • 1
    You could use getElementsByTagName to get an html element with DOMDocument in PHP. +1 to this suggestion. More info here: http://www.php.net/manual/en/domdocument.getelementsbytagname.php – Tek Dec 09 '10 at 22:30
  • Hello Daniel, hello Tek, many thanks for your answers. You sugest to run the getElementsByTagName - i will have a closer look at this! The website has invalid code - unfortunatley... That said i think it is a bad approach to do it with REGEX – zero Dec 09 '10 at 22:39