0

I'm having trouble with grep.. Which four patterns should I use with PHP's preg_grep to extract all instances the "__________" stuff in the strings below?

1. <h2><a ....>_____</a></h2>
2. <cite><a href="_____" .... >...</a></cite>
3. <cite><a .... >________</a></cite>
4. <span>_________</span>

The dots denote some arbitrary characters while the underscores denote what I want.

An example string is:

     </style></head>
<body><div id="adBlock"><h2><a href="https://www.google.com/adsense/support/bin/request.py?contact=afs_violation&amp;hl=en" target="_blank">Ads by Google</a></h2>
<div class="ad"><div><a href="http://www.google.com/aclk?sa=L&amp;ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&amp;num=1&amp;sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&amp;q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="titleLink" target="_parent">Spider-<b>Man</b> Animated Serie</a></div>
<span>See Your Favorite Spiderman
<br>
Episodes for Free. Only on Crackle.</span>
<cite><a href="http://www.google.com/aclk?sa=L&amp;ai=C4vfT4Sa3S97SLYO8NN6F-ckB5oq5sAGg6PKlDaT-kwUQASCF4p8UKARQtobS9AVgyZbRhsijoBnIAQGqBBxP0OSEnIsuRIv3ZERDm8GiSKZSnjrVf1kVq-_Y&amp;num=1&amp;sig=AGiWqtwG1qHnwpZ_5BNrjrzzXO5Or6EDMg&amp;q=http://www.crackle.com/c/Spider-Man_The_New_Animated_Series/%3Futm_source%3Dgoogle%26utm_medium%3Dcpc%26utm_campaign%3DGST_10016_CRKL_US_PRD_S_TeleV_SPID_Tele_Spider-Man%26utm_term%3Dspiderman%26utm_content%3Ds264Yjg9f_3472685742_487lrz1638" class="domainLink" target="_parent">www.Crackle.com/Spiderman</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&amp;ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&amp;num=2&amp;sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&amp;q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="titleLink" target="_parent">Kids <b>Batman</b> Costumes</a></div>

<span>Great Selection of <b>Batman</b> &amp; Batgirl
<br>
Costumes For Kids. Ships Same Day!</span>
<cite><a href="http://www.google.com/aclk?sa=l&amp;ai=CnQFi4Sa3S97SLYO8NN6F-ckB3M7nQtyU2PQEq6bCBRACIIXinxQoBFCm15KB-f____8BYMmW0YbIo6AZoAHiq_X-A8gBAaoEIU_Q9JKLiy1MiwdnHpZoBnmpR1J8pP2jpTwMx2uj2nN4WA&amp;num=2&amp;sig=AGiWqtwDrI5pWBCncdDc80FKt32AJMAQ6A&amp;q=http://www.costumeexpress.com/browse/TV-Movies/_/N-1z141uu/Ntt-batman/results1.aspx%3FREF%3DKNC-CEgoogle" class="domainLink" target="_parent">www.CostumeExpress.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&amp;ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&amp;num=3&amp;sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&amp;q=http://www.OfficialBatmanCostumes.com" class="titleLink" target="_parent"><b>Batman</b> Costume</a></div>
<span>Official <b>Batman</b> Costumes.

<br>
Huge Selection &amp; Same Day Shipping!</span>
<cite><a href="http://www.google.com/aclk?sa=l&amp;ai=CAMYT4Sa3S97SLYO8NN6F-ckB3ZnWmgGdoNLrDaumwgUQAyCF4p8UKARQrqSVxwdgyZbRhsijoBmgAZH77uwDyAEBqgQYT9DU7oqLLEyLB2dHlxZFnQzyeg-yHt88&amp;num=3&amp;sig=AGiWqtzqAphZ9DLDiEFBJlb0Ou_1HyEyyA&amp;q=http://www.OfficialBatmanCostumes.com" class="domainLink" target="_parent">www.OfficialBatmanCostumes.com</a></cite></div> <div class="ad"><div><a href="http://www.google.com/aclk?sa=l&amp;ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&amp;num=4&amp;sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&amp;q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="titleLink" target="_parent">Discount <b>Batman</b> Costumes</a></div>
<span>Discount adult and kids <b>batman</b>
<br>
superhero costumes.</span>

<cite><a href="http://www.google.com/aclk?sa=l&amp;ai=C767t4Sa3S97SLYO8NN6F-ckBkZfSfoOppaMHq6bCBRAEIIXinxQoBFDX2bw6YMmW0YbIo6AZoAHpprP8A8gBAaoEG0_QhJSMiytMiwdnHpZoF3g0Uj8_Vl2r4TpI_g&amp;num=4&amp;sig=AGiWqtyGO2DnFq_jMhP6ufj8pufT9sWQWA&amp;q=http://www.discountsuperherocostumes.com/batman-costumes.html" class="domainLink" target="_parent">www.discountsuperherocostumes.com</a></cite></div></div></body>
<script type="text/javascript">
      var relay = "";
    </script>
<script type="text/javascript" src="/uds/?file=ads&amp;v=1&amp;packages=searchiframe&amp;nodependencyload=true"></script></html>

Thanks!

rawrrrrrrrr
  • 3,647
  • 7
  • 28
  • 33
  • 3
    Don't ... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Joey Apr 03 '10 at 11:50

1 Answers1

3

First of all, you should not use regex to extract data from an HTML string.
Instead, you should use a DOM Parser !

Here, you could use :


For example, you could load your document, and instanciate the DOMXpath class this way :

$html = <<<HTML
....
....
HTML;

$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

And, then, use XPath to find the elements you are looking for.


For example, in the first case, you could use something like this, to find all <a> tags that are children of <h2> tags :

// <h2><a ....>_____</a></h2>
$tags = $xpath->query('//h2/a');
foreach ($tags as $tag) {
    var_dump($tag->nodeValue);
}
echo '<hr />';


Then, for the second and third case, you are searching for <a> tags that are children of <cite> tags -- and when you've found them, you want to check if they have a href attribute or not :

// <cite><a href="_____" .... >...</a></cite>
// <cite><a .... >________</a></cite>
$tags = $xpath->query('//cite/a');
foreach ($tags as $tag) {
    if ($tag->hasAttribute('href')) {
        var_dump($tag->getAttribute('href'));
    } else {
        var_dump($tag->nodeValue);
    }
}
echo '<hr />';


And, finally, for the last one, you just want <span> tags :

// <span>_________</span>
$tags = $xpath->query('//span');
foreach ($tags as $tag) {
    var_dump($tag->nodeValue);
}


Not that hard -- and much easier to read that regexes, isn't it ? ;-)

Pascal MARTIN
  • 395,085
  • 80
  • 655
  • 663
  • Do you know how to debug the if $tag hasAttribute block? I am absolutely sure that the tag I'm searching has the attributes I'm searching for, but it's not recognizing them, so it's dumping everything between the parent tag. IE... I'm searching the body tag for onload, and it's ignoring the onload attribute and printing everything between open body and close body. – EllaJo May 13 '11 at 14:43