0

I have this html code:

<html>
<div class="the_grp">
<h3>heading <span id="sn-sin" class="the_decs">(keyword: <i>cat</i>)</span></h3>
<ul>
    <li>
        <div>
            <div><span class="w_pos"></span></div>
            <div class="w_the">
            <a href="http://www.exampledomain.com/20111/cute-cat">cute cat</a>, 
            <a href="http://www.exampledomain.com/7456/catty">catty</a>, 
            </div>
        </div>
    </li>   
    <li>
        <div>
            <div><span class="w_pos"></span></div>
            <div class="w_the">
            <a href="http://www.exampledomain.com/7589/sweet">sweet</a>, 
            <a href="http://www.exampledomain.com/10852/sweet-cat">sweet cat</a>, 
            <a href="http://www.exampledomain.com/20114/cat-vs-dog">cat vs dog</a>, 
        </div>
    </li>
</ul>
</div>

<a id="ant"></a>
<div class="the_grp">
<h3>another heading <span id="sn-an" class="the_decs">(ignore this: <i>cat</i>)</span></h3>
<ul>
    <li>
        <div>
            <div><span class="w_pos"></span></div>
            <div class="w_the"><a href="http://www.exampledomain.com/118/bad-cat">bad cat</a></div>
        </div>
    </li>
</ul>
</div>

i want to match the following words from the html code:

  • cute cat
  • catty
  • sweet
  • sweet cat
  • cat vs dog

i'm using this pattern and capturing [2] to get those words:

#<a href="http\:(.*?)">(.*?)<\/a>#i

my php code looked like this:

preg_match_all('#<a href="http\:(.*?)">(.*?)<\/a>#i', $data, $matches);
echo '<pre>';
print_r($matches[2]);
echo '</pre>';

That pattern match "bad cat" too. How to capture only this following words: cute cat, catty, sweet, sweet cat, cat vs dog?

Thanks in advance.

danul
  • 41
  • 5
  • 1
    I'll refer to [this post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ChrisG Mar 15 '17 at 19:09
  • Don't use regex for parsing HTML. – vallentin Mar 15 '17 at 19:10
  • The pattern that you're using will match everything inside `a`. The thing that you're trying to do is scraping, just look for a PHP library for this. – MikeVelazco Mar 15 '17 at 19:12
  • @MikeVelazco i use simple html dom before, i still can't find solution because those words stay in same div class. – danul Mar 15 '17 at 19:19
  • I'm not a Regex epert, but you can replace the second `(.*?)` with `(cute cat|catty|sweet|sweet cat|cat vs dog)` – MikeVelazco Mar 15 '17 at 19:22

1 Answers1

0

It would be best just to use an HTML parser. Here's how you do it by using http://simplehtmldom.sourceforge.net/.

file_get_html would be preferably, it will go basically call file_get_contents and str_get_html,

str_get_html is how you can parse string to an simple html dom object.

<?php

require('simple_html_dom.php');

$html = str_get_html(/*your html here*/);

foreach($html->find('a') as $element) 
       echo $element->plaintext  . '<br>';

?>

And if you don't want bad cat to match, simply loop through the results and remove/ignore it that way.

And if you want to remove bad cat:

foreach($html->find('a') as $element) 
    if ($element->plaintext != "bad cat")
       echo $element->plaintext  . '<br>';
Neil
  • 14,063
  • 3
  • 30
  • 51