1

This expression only gets the values between angle brackets > < when they are numeric. I want to get them in any case.

function GetProducts($file){
    $regex = "|class=\"producto\"[^>]+>([0-9]*)</[^>]+>|U";
    if(!is_file($file)) return false;
    preg_match_all($regex,file_get_contents($file), $result);
    foreach($result[1] as $key =>$value) $result[$key] = (int) $value;
    return $result;
}

This is my HTML code:

<a class="producto" href="ver.asp?id=4013">A86028</a></span><!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">1027C</a></span><!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">5611 4020</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">396-4185</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">834006-5-7</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">5601GR 4325GR</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">2182CR(2)</a></span>
<!-- /a --></td></tr>
    <a class="producto" href="ver.asp?id=4014">1458-54-63-55</a></span>
<!-- /a --></td></tr>

My desired output is:

Array ([1] => 1027 [2] => 5611 [3] => 5396 [4] => 834006 [5] => 5601 [6] => 2182 [7] => 1458) 
hwnd
  • 69,796
  • 4
  • 95
  • 132

3 Answers3

2

This might work, but as people say parsing html with regex is problematic.

 # class="producto"[^>]+>([^<]*)</[^>]+>

 class="producto" [^>]+ >
 ( [^<]* )
 </ [^>]+ >
  • To quote the bountied answer of the very post that so berates HTML regex parsing, *While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a* **limited, known set of HTML**. And this is the case here. – LSerni Sep 11 '14 at 20:50
  • Yeah, I could throw down a 15k regex to parse html and its still problematic. Especially entities and substitutions. I rationalize this pertains even to a known set of html. –  Sep 11 '14 at 21:17
1

You've asked for a pure regular expression here, but it's not the right tool for parsing HTML.

function _matcher ($m, $str) {
  if (preg_match('/^\d+/', $str, $matches))
    $m[] = $matches[0];
  return $m;
}

$dom = new DOMDocument;
$dom->loadHTML($html); 
$xpath = new DOMXPath($dom);

foreach ($xpath->query('//a[@class="producto"]') as $link) {
   $vals[] = $link->nodeValue;
}

print_r(array_reduce($vals, '_matcher', array()));

Output ( Working Demo )

Array
(
    [0] => 1027
    [1] => 5611
    [2] => 396
    [3] => 834006
    [4] => 5601
    [5] => 2182
    [6] => 1458
)
hwnd
  • 69,796
  • 4
  • 95
  • 132
0

You can use a regex like this:

([\w\s-\(\)]+)</

Working demo

enter image description here

The idea is to capture alphanumeric, dashes and paretheses before your .

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123