0

Possible Duplicate:
How to parse and process HTML with PHP?

I'm trying to scrape a page with PHP using file_get_contents(). This page has some JSON wrapped in a bit of HTML. I'd like to strip out this HTML to be able to use json_decode() on the scraped string so I can deal with the JSON separately. Is there any clean way to do that? A quick search didn't really lead to anything. Thanks

Community
  • 1
  • 1
sf89
  • 5,088
  • 7
  • 24
  • 27

1 Answers1

1

parsing/stripping HTML content is always a tricky one because (common?) solutions via regex might crash if the HTML markup is malformed and are painful slow btw. I would suggest using this little HTML DOM parser class:

http://simplehtmldom.sourceforge.net/


edited & added from subcomment:

Okay this is a bad one because the inline javascript is not properly wrapped with CDATA-Tags. Otherwise something like this might work:

$html = new simple_html_dom();
$html->load_file('your-external-file');

foreach($html->find("script") as $obj) {
    if(isset($obj->innertext) && strpos($obj->innertext, 'window._jscalls'))
        echo $obj->innertext;
}
simplyray
  • 1,200
  • 1
  • 16
  • 25
  • Yeah using a regex is out of the question. I've thought of Simple HTML DOM, but since it's JSON I'm trying to parse, I can't really go with that as the returned string would only contain the HTML and not the JSON... – sf89 Nov 16 '12 at 08:57
  • Could you provide a example of the HTML/JSON markup? – simplyray Nov 16 '12 at 08:59
  • I'm on my phone right now but the client made something pretty similar as what you can find on Instagram pages (like this one http://instagram.com/kevin). Thx – sf89 Nov 16 '12 at 09:02
  • Okay this is a bad one because the inline javascript is not properly wrapped with CDATA-Tags. Otherwise something like this might work: see top post (quite dirty though). – simplyray Nov 16 '12 at 10:10
  • Great way of using HTML Simple Dom buddy, thanks a lot, I can get it to work this way, even though it's not very clean, it does the trick a least temporarily. – sf89 Nov 17 '12 at 01:08