-1

I have the html document in a php $content. I can echo it, but I just need all the <a...> tags with class="pret" and after I get them I would need the non words (like a code i.e. d3852) from href attribute of <a> and the number (i.e. 2352.2345) from between <a> and </a>.

I have tried more examples from the www but I either get empty arrays or php errors.

A regex example that gives me an empty array (the <a> tag is in a table)

$pattern = "#<table\s.*?>.*?<a\s.*?class=[\"']pret[\"'].*?>(.*?)</a>.*?</table>#i";
preg_match_all($pattern, $content, $results);
print_r($results[1]);

Another example that gives just an error

$a=$content->getElementsByTagName(a);

Reason for various errors: unvalid html, non utf 8 chars.

Next I did this on another website, matched the contents in a single SQL table, and the result is a copied website with updated data from my country. No longer will I search the www for matching single results.

Joita Dan
  • 43
  • 8
  • 3
    Why don't you use dom parser instead of one complex regexp ? – Adidi Apr 20 '13 at 18:30
  • Your (any) regexp is likely to break in future. You should use [PHP DOM parsing](http://php.net/manual/en/book.dom.php) to get it done. – Ejaz Apr 20 '13 at 18:34
  • i did try with dom but resulted in errors, so i think the empty array i got with regex is closer to a result – Joita Dan Apr 20 '13 at 18:38
  • And may we ask what errors did you get? – Vyktor Apr 20 '13 at 18:46
  • 500 Internal Server Error – Joita Dan Apr 20 '13 at 18:55
  • @JoitaDan: For all these cases, please use the search. Not only are regexes explained lengthy for HMTL but also why they are not working well with HTML and how you get DOMDocument running. On top of that, when you get a 500 Internal error is also explained how you can solve it. Let me know if you need *concrete* help with anything of these three. – hakre Apr 21 '13 at 00:18
  • You can find [the PHP Error Reference here](http://stackoverflow.com/q/12769982), hopefully it is useful for you. You should also see how you can enable error logging with PHP and track the error log. – hakre Apr 21 '13 at 00:19

2 Answers2

2

Let's hope you're trying to parse valid (at least valid enough) HTML document, you should use DOM for this:

// Simple example from php manual from comments
$xml = new DOMDocument(); 
$xml->loadHTMLFile($url); 
$links = array(); 

foreach($xml->getElementsByTagName('a') as $link) { 
    $links[] = array('url' => $link->getAttribute('href'),
                     'text' => $link->nodeValue); 
} 

Note using loadHTML not load (it's just more robust against errors). You also may set DOMDocument::recover (as suggested in comment by hakre) so parser will try to recover from errors.

Or you could use xPath (here's explanation of syntax):

$xpath = new DOMXpath($doc);
$elements = $xpath->query("//a[@class='pret']");

if (!is_null($elements)) {
    foreach ($elements as $element) {
        $links[] = array('url' => $link->getAttribute('href'),
                         'text' => $link->nodeValue); 
    }
}

And for case of invalid HTML you may use regexp like this:

$a1 = '\s*[^\'"=<>]+\s*=\s*"[^"]*"'; # Attribute with " - space tolerant
$a2 = "\s*[^'\"=<>]+\s*=\s*'[^']*'"; # Attribute with ' - space tolerant
$a3 = '\s*[^\'"=<>]+\s*=\s*[\w\d]*' # Unescaped values - space tolerant
# [^'"=<>]* # Junk - I'm not inserting this to regexp but you may have to

$a = "(?:$a1|$a2|$a2)*"; # Any number of arguments
$class = 'class=([\'"])pret\\1'; # Using ?: carefully is crucial for \\1 to work
                                 # otherwise you can use ["']
$reg = "<a{$a}\s*{$class}{$a}\s*>(.*?)</a";

And then just preg_match_all.All regexp are written from the top of my head - you may have to debug them.

Vyktor
  • 20,559
  • 6
  • 64
  • 96
  • tried something like this, same error message, so it may be the document is not valid HTML....... pain – Joita Dan Apr 20 '13 at 18:50
  • after validating with w3, i got Sorry! This document cannot be checked. a non utf 8 char... – Joita Dan Apr 20 '13 at 18:56
  • so maybe removing non utf 8 will help? – Joita Dan Apr 20 '13 at 18:58
  • if there are no non utf 8 chars, html 5 validator saysStray end tag head. ;An body start tag seen but an element of the same type was already open. ;Cannot recover after last error. Any further errors will be ignored. – Joita Dan Apr 20 '13 at 19:03
  • How would I get just my table from all the document, to avoid errors? – Joita Dan Apr 20 '13 at 19:06
  • Why aren't you matching just for `` ? – Vyktor Apr 20 '13 at 20:11
  • it doesn't work, don't ask me why but I get more tags matching a and less lines matching table, still I am trying to get inside that ugly /html/body/table/tr/td/table/tr/td/table/tr/td/a – Joita Dan Apr 20 '13 at 20:27
  • 1
    http://www.php.net/class.domdocument.php#domdocument.props.recover - also errors can be supressed, just search for the exisitng Q&A here on this website. We have these cases all covered. – hakre Apr 21 '13 at 00:20
0

got the links like this

preg_match_all('/<a[^>]*class="pret">(.*?)<\\/a>/si', $content, $links);
print_r($links[0]);

and the result is

Array(
[0] => <a href='/word_word_34670_word_number.htm' class="pret"><span>3340.3570 word</span></a>..........)

so I need to get the first number inside href and the number between span

Joita Dan
  • 43
  • 8