0

Currently I'm scraping data from a page of HTML. One of my code is not working. The HTML content has something like this.

<ul class="pagination">
    <li>
        <span class="page active">
            1
        </span>
    </li>
    <li>
        <a class="page available" href="/somethingherewithanychars1">
            2
        </a>
    </li>
    <li>
        <a class="page available" href="/somethingherewithanychars2">
            3
        </a>
    </li>
    <li>
        <a class="page available" href="/somethingherewithanychars3">
        4
        </a>
    </li>
<ul>

i tried this code to get the href value next to the active page link, like in the example the active page link is page number 1 so the href value that i will get must be page number 2 where the value is /somethingherewithanychars1 but it is not working

$file_string = file_get_contents($url); 
 preg_match('/<li><span class="page active">.*?<\/span><\/li><li><a class="page available" href="(.*)">/i', $file_string, $pages); 

print_r($pages);

The html that i was accessing has some code like this

<div class="attributes">
   <a class="name" href="/linksTothissite" data-hovercard-id="somechars">link1</a>
   <span class="list">
    USA
   </span>
   <a class="name" href="/linksTothissite" data-hovercard-id="somechars">link2</a>
   <span class="list">
    CANADA
   </span>
</div>

I tried getting the values using this code, and i can get the link1 and link2

preg_match_all('/<a class="name" href=".*?" data-hovercard-id=".*?">(.*)<\/a>/i', $file_string, $values); 

also this one i can get the USA, and CANADA

 preg_match_all('/<span class="list">(.*?)<\/span>/s',$file_string, $values); 
         $val= $values[1]; 

Why is my preg_match not getting the value i need? i tried using also pre_match_all() but still i get an output in my print_r Array ( ), but the rest of my code works.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Snippet
  • 1,522
  • 9
  • 31
  • 66
  • 3
    Maybe this helps: http://stackoverflow.com/a/1732454 – Cobra_Fast Sep 01 '13 at 16:23
  • Regex is only suitable to such tasks if you know what you're doing. Seeing that your preg_match doesn't even account for the whitespace between `
  • ` and `
  • – mario Sep 01 '13 at 16:24
  • i see. i tried adding on my code to ignore whitespace and newline but still not working. also the other code have newline but the i can get the values of it. – Snippet Sep 01 '13 at 16:36
  • @Cobra_Fast Please don't post links to that question, because they are not helpful to the reader, unless you follow it up with something that is an answer they can use. *You* know the point of the comment and that wall of text is that parsing HTML with regexes is a bad idea. However, to someone else who is asking, that is not at all clear. Worse, it doesn't point the reader to any useful solutions that *can* help parse HTML reliably. – Andy Lester Sep 01 '13 at 21:02
  • @AndyLester The post I've linked ends with "Have you tried using an XML parser instead?"... – Cobra_Fast Sep 01 '13 at 22:05
  • @Cobra_Fast: I suspect that that gets lost amidst the mass of unreadable text. If you're going to suggest an XML parser, why not simply make the comment "Have you tried using an XML parser instead?" which is far more likely to actually be read by the OP. – Andy Lester Sep 02 '13 at 15:48
  • @AndyLester Because posting that answer is way more entertaining. I'd prefer to not discuss the controversy of the answer I've linked here since it already spawned extensive meta threads with good arguments for all different viewpoints... – Cobra_Fast Sep 02 '13 at 15:54
  • @Cobra_Fast: I prefer to go with helpful over entertaining to the commenters. I'm sure that those we are here to help do, too. – Andy Lester Sep 03 '13 at 03:31