Extracting content from a website with Regular Expressions

Question

Hi I'm just trying to get a hang of regular expressions, i have being trying to extract content from this website but i reckon i have a problem with my regexp, as i cannot add anything to the array. Can anyone point me in the right direction, I reckon its just something small.

Thanks

<?php   
    $f1 = fopen("http://www.irishexaminer.com/","r");
    $document = fread($f1,100000);
    fclose($f1);
    $regexp = "%<p>(.+)</p><p>%";
    preg_match($regexp,$document,$getHeading);  
    echo "<br>" . $getHeading[1];
    echo '<pre>';
    print_r($getHeading);
    echo '</pre>';
?>

Have you tried confirming that `$document` does actually contain the html? — JaredC, Jan 10 '13 at 15:07
Yes i just had another look and it does contain
tags such as:
THERE is no excuse for loyalist violence on the streets of Belfast. — user1344192, Jan 10 '13 at 15:21
Also: You probably shouldn't be parsing HTML with Regular Expressions. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — FrankieTheKneeMan, Jan 10 '13 at 15:28
Much as it pains me, they're using very strange HTML syntax, and your regex is incorrect. Try `"%
(.+?)
%"` — FrankieTheKneeMan, Jan 10 '13 at 15:31
Some people who are wizards with regular expressions insist on using them to parse HTML/XML, and they still get it wrong. If you're not a wizard with regular expressions, don't even attempt it. It's the wrong tool for the job. Use a proper parser. — Michael Kay, Jan 10 '13 at 16:11

Robert Cutajar · Accepted Answer · 2013-01-10T17:24:59.543

1

THERE is no excuse for white space in the closing tag of p in your case.

<p> THERE is no excuse for loyalist violence on the streets of Belfast.<p /><p>

Regex to match

%<p>(.+)</\s*p><p>%

It would take a while to make a regex resilient enough for HTML. Take Frankies advice too. Vest your effort into something less prone to failure. You can use PHP HTML Tidy

edited Jan 10 '13 at 17:24

answered Jan 10 '13 at 16:33

Robert Cutajar

3,181
1
30
42

Extracting content from a website with Regular Expressions

1 Answers1