regexp works when extracting HTML, but not with file_get_contents

Question

This is my code

$file_string = file_get_contents('http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013');

preg_match_all('/<th.*>.*Organizer.*<a.*>(.*)<\/a>/msi', $file_string, $organizer);
if (empty($organizer[1])) {
    echo "Couldn't get organizer \n";
    $stats['organizer'] = 'ERROR';
}
else {
    $stats['organizer'] = $organizer[1];
}

I'm trying to get the "Organizer" field from the right-hand "League Information" box on http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013 but it isn't working.

However (and this is what I don't understand), when I copy the HTML from the page and change the $file_string variable to this:

$file_string = '<tr>
<th valign="top"> Organizer:
</th>
<td style="width:55%;"> <a rel="nofollow" target="_blank" class="external text" href="http://www.northcon.de/">NorthCon</a>
</td></tr>';

The regexp works. Perhaps it could be JavaScript or something? However, I'm able to extract the data of pretty much all of the other rows in the same box, using regexp. I swear I'm missing something obvious here, maybe I just need a set of fresh eyes?

**Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php or [this SO thread](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. — Andy Lester, Jan 15 '14 at 15:50
Well... The most obvious thing to try is `var_dump($file_string)`. — Álvaro González, Jan 15 '14 at 15:52
Fix your regexp. `` is either valid with `` and `blabla
infinite more html
` — Peter, Jan 15 '14 at 16:06
@AndyLester Thanks, I will look into these for the future, I'm actually using Python and BeautifulSoup mostly these days, but this was an old script which I wanted to re-use, and I just need to add on this functionality, so I was hoping there was a swift solution to this. — Anders, Jan 15 '14 at 16:37
@ÁlvaroG.Vicario I've actually done this. It gives me exactly the HTML I'd expect — Anders, Jan 15 '14 at 16:37

score 2 · Accepted Answer · answered Jan 16 '14 at 02:20

This code should work:

$file_string = file_get_contents('http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013');

preg_match_all('/<th.{0,30}>.*Organizer.*?<\/a>/msi', $file_string, $organizer);
print_r($organizer);
if (empty($organizer[0])) {
    echo "Couldn't get organizer \n";
    $stats['organizer'] = 'ERROR';
}
else {
    $stats['organizer'] = $organizer[0];
}

Instead of $organizer[1] put $organizer[0] because that will be your first (and only) result. You had to make .* lazy by putting question mark after it. That means that it will stop searching once it finds what its looking for.

For example this code

<a.*>(.*)<\/a>

Will search from first tag to last one on page (it doesn't stop when it finds </a>) while this code

<a.*?>(.*?)<\/a>

will stop searching after it finds first </a>

Check source code once you echo it. This will be result(I assume you wanted it like this with html included):

<th valign="top"> Organizer:
</th>
<td style="width:55%;"> <a rel="nofollow" target="_blank" class="external text" href="http://www.northcon.de/">NorthCon</a>

Thanks! I still needed to get the name of the Organizer out, which this didn't allow me to, but I did manage to figure that bit out from here. Thanks again for the help. :) — Anders, Jan 16 '14 at 12:02

regexp works when extracting HTML, but not with file_get_contents

1 Answers1