I have the following regex in a PHP script
$total_matches = preg_match_all('{
<a\shref="
(?<link>[^"]+)
"(?:(?!src=).)+src="
(?<image>[^"]+)
(?:(?!designer-name">).)+designer-name">
(?<brand>[^<]+)
(?:(?!title=).)+title="
(?<title>((?!">).)+)
(?:(?!"price">).)+"price">\$
(?<price>[\d.,]+)
}xsi',$output,$all_matches,PREG_SET_ORDER);
this regex seems to work fine when parsing the following (via php or using the parser at regexr.com (with the same options set for case insensitive, extended, treat line breaks as whitespace):
<a href="http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
"DORDOGNE 120 PLATEAU SANDALEN" class="product-image">
<img class="image1st" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/small_ image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-STANDARD.jpg"
width="230" height="260"
alt= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH"
title= "Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" />
<img class="image2nd" src= "http://mytheresaimages.s3.amazonaws.com/catalog/product/cache/common/product_114114/image/230x260/9df78eab33525d08d6e5fb8d27136e95/P/0/P00027794-DORDOGNE-120-PLATEAU-SANDALEN-DETAIL_2.jpg"
width="230" height="260" alt=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" title=
"Christian Louboutin - DORDOGNE 120 SANDALS - mytheresa.com GmbH" /> <span class=
"availability"><strong>available sizes</strong><br /></span></a>
<div style="margin-left: 2em" class="available-sizes">
<h2 class="designer-name">Christian Louboutin</h2>
<div class="product-buttons">
<div class="product-button">
NEW ARRIVAL
</div>
<div class="clearer"></div>
</div>
<h3 class="product-name"><a href=
"http://www.mytheresa.com/us_en/dordogne-120-sandals.html" title=
"DORDOGNE 120 SANDALS">DORDOGNE 120 SANDALS</a></h3>
<div class="price-box">
<span class="regular-price" id="product-price-114114"><span class=
"price">$805.00</span></span>
</div>
If I try to parse several matches in a row, it works fine also. However when I try parsing the full page these matches come from (I have permission to parse this)
http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12
the regex fails (I actually get a 500 error). I've tried increasing the backtrack limit using
ini_set('pcre.backtrack_limit',100000000);
ini_set('pcre.recursion_limit',100000000);
but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up. I am running PHP 5.2.17.