-3

I have been always using preg_match to scrape URLs from HTML files but I wanted to extract only URLs that have .mp3 as their extension. I was told to try DOM and I have been trying to fix a code but it doesn't work. I get a blank page whatever I do.

What am I doing wrong?

<?php
    $url = 'http://www.mp3olimp.net/miley-cyrus-when-i-look-at-you/';
    $html = @file_get_html($url);
    $dom = new DOMDocument();
    $doc->loadHTML($html);
    $xpath = new DOMXPath($doc); 
    $links = $xpath->query('//a[ends-with(@href, ".mp3")]/@href');

    echo $links;
?>
Ry-
  • 218,210
  • 55
  • 464
  • 476
andrew
  • 31
  • 1
  • 9
  • What happens with print_r($links) instead of echo? – Malcolm Diggs Jun 20 '13 at 23:25
  • @MalcolmDiggs the result is the same, a blank page – andrew Jun 20 '13 at 23:29
  • 1
    Well the first thing I would do is remove the @ sign from @file_get_html. Prepending the @ just suppresses errors, but in this case, you WANT to see errors, so you might as well remove it and let the script tell you what's going wrong. – Malcolm Diggs Jun 20 '13 at 23:37
  • You need to do basic troubleshooting, that means, understand how PHP errors and where you can obtain more information about errors. See as well: [How to get useful error messages in PHP?](http://stackoverflow.com/q/845021/367456) – hakre Jun 22 '13 at 00:11

2 Answers2

4

There are a couple of problems!

  • As noted, remove @ before file_get_html() to see the errors.
  • file_get_contents($url) will work to get the HTML contents.
  • Typo, $dom = should be $doc =
  • Another annoying point, the HTML source is fairly malformed, leading to later errors.
  • ends-with() is only supported in XPath 2.0, PHP uses XPath 1.0. So you'll have to find another way to check the ending. A bit of regex should do the trick.
TimWolla
  • 31,849
  • 8
  • 63
  • 96
  • Be sure to use proper code formatting to make your answer easier to read. – TimWolla Jun 21 '13 at 00:31
  • 1
    Thanks for this! Just getting into this whole StackOverflow thing. Long time reader, first time poster (cliche, I know). – Matthew FitzGerald-Chamberlain Jun 21 '13 at 00:50
  • You're welcome. Make sure to read the [help](http://stackoverflow.com/help) and have a look at the options the editor offers you. That way it should be easy to write some good answers and gain reputation. – TimWolla Jun 21 '13 at 00:52
0
$input = file_get_contents($url);    
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?.mp3)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
  foreach($matches as $match) {
    // $match[2] = link address
    // $match[3] = link text
  }
}
Dani-san
  • 304
  • 2
  • 4