1

I have the following regex in a PHP script

$total_matches = preg_match_all('{

        <a\shref="
        (?<link>[^"]+)
        "(?:(?!src=).)+src="
        (?<image>[^"]+)
        (?:(?!designer-name">).)+designer-name">
        (?<brand>[^<]+)
        (?:(?!title=).)+title="
        (?<title>((?!">).)+)
        (?:(?!"price">).)+"price">\$
        (?<price>[\d.,]+)

}xsi',$output,$all_matches,PREG_SET_ORDER);

this regex seems to work fine when parsing the following (via php or using the parser at regexr.com (with the same options set for case insensitive, extended, treat line breaks as whitespace):

<a href="http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
  "DORDOGNE 120 PLATEAU SANDALEN" class="product-image">
  <img class="image1st" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/small_  image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-STANDARD.jpg"
   width="230" height="260" 
   alt=   "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" 
   title= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> 
 <img class="image2nd" src=  "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-DETAIL_2.jpg"
width="230" height="260" alt=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" title=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> <span class=
"availability"><strong>available sizes</strong><br /></span></a>

<div style="margin-left: 2em" class="available-sizes">
<h2 class="designer-name">Christian Louboutin</h2>

<div class="product-buttons">
  <div class="product-button">
    NEW ARRIVAL
  </div>

  <div class="clearer"></div>
</div>

<h3 class="product-name"><a href=
"http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
"DORDOGNE 120 SANDALS">DORDOGNE 120 SANDALS</a></h3>

<div class="price-box">
  <span class="regular-price" id="product-price-114114"><span class=
  "price">$805.00</span></span>
</div>

If I try to parse several matches in a row, it works fine also. However when I try parsing the full page these matches come from (I have permission to parse this)

http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12

the regex fails (I actually get a 500 error). I've tried increasing the backtrack limit using

ini_set('pcre.backtrack_limit',100000000);
ini_set('pcre.recursion_limit',100000000);

but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up. I am running PHP 5.2.17.

cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
jela
  • 1,449
  • 3
  • 23
  • 30
  • 1
    and use have permission to use there content? –  Aug 10 '11 at 03:17
  • 1
    Also check the `PCRE_VERSION` constant. If it is reasonably outdated, try to install an updated `libpcre`. The `(?!..).+)` assertions are probably pricey. Unless you want to rework the regex or split it up into a preg_replace_callback, consider using a html toolkit like phpQuery or QueryPath for extraction (easier, and often not measurably slower). – mario Aug 10 '11 at 03:21
  • @mario my PCRE_VERSION is 8.02 2010-03-19, I'm not sure if that qualifies it as old (it's 4 versions out of date). I think I might have to rework the regex. I'm surprised the lookaheads are expensive, but I think you're probably right. I'll look into phpQuery and QueryPath if I can't rework the regex. – jela Aug 10 '11 at 04:43
  • 1
    *(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) – Gordon Aug 10 '11 at 07:09

2 Answers2

3

You have made one of the classic blunders! Don't use regex to parse HTML! It breaks regex! (This is right after "Never get involved in a land war in Asia" and "Never go in against a Sicilian when death is on the line.").

You should be using SimpleXML or DomDocument to parse this:

$dom = new DomDocument();
$dom->loadHTML( 'http://www.mytheresa.com/us_en/new-arrivals/'.
                 'what-s-new-this-week-1.html?limit=12' );

$path = new DomXPath( $dom );
// this query is based on the link you provided, not your regex
$nodes = $path->evaluate( '//ul[class="products-grid first odd"]/li' );
foreach( $nodes as $node )
{
    // children 0 = anchor tag you're looking for initially.
    echo $node->children[0]->getAttribute("href");
    // iterate through the other children that way
}
cwallenpoole
  • 79,954
  • 26
  • 128
  • 166
  • 2
    We need a new "Inconceivable" badge! – Phil Aug 10 '11 at 03:55
  • 1
    Come on, it's *certainly conceivable* and sometimes the only chance if you have huge legacy frontpage cruft to put up with. – ZJR Aug 10 '11 at 04:37
  • @ZJR You missed your opportunity to say, "That word, I do not think it means what you think it means." – cwallenpoole Aug 10 '11 at 04:40
  • thanks for the suggestion. Ideally I'll get this working as it is, and then go back and do it properly using DomDocument(), since it looks like it might take a bit of reading to get proficient with it, and I need at least a clunky version soon. Also I think understanding what's going wrong will at least teach me something about regex. I tried running your code as a script and got no result, I think maybe loadHTML is timing out. I'll fiddle with it in the morning, thanks for taking the time to write it up. – jela Aug 10 '11 at 04:48
1

Those negative lookaheads are clever, but then... slightly too clever.

And I concur, you used too many of them, not to get side effects.

Can't see which one is running wild right now, but putting a repeated . like that... is always bound to give you greediness problems.

this one for example, is certainly unnecessary:

title="
(?<title>((?!">).)

as you could have written it

title="(?<title>.*?)">

...there are more like it. I'd change them all.

In general, regex debugging implies rephrasing and rephrasing it again and again and again, using different constructs until you find the right balance between functionality and mantainability.

Another thing: I would use <a\s+ instead of <a\s, just slightly more flexible.
Stay slightly flexible, it pays.

Also: title= could present itself as title\s*=\s*

ZJR
  • 9,308
  • 5
  • 31
  • 38
  • that's an interesting case with the title, because you're technically correct that the lookahead is superfluous. The problem is that sometimes whoever writes the html fails properly to encode double quotation marks in the title, which means that I can't trust a double quotation mark by itself to mean the end of the title. In any event I'll start replacing negative lookaheads with lazy stars and see what happens. You're right about adding the spaces for sure. – jela Aug 10 '11 at 04:59