Get contents page using regular expression

Question

<?php
$source='http://www.google.com/finance';
//$source='sample.txt';
$page_all = file_get_contents($source);
$div_array=array();
preg_match_all('#<div id="markets">(.*?)</div>#sim', $page_all, $div_array);
//print_r($div_array);
print_r($div_array[1]);
?>

I have this peice of code. I am trying to return the contents of a specific div from google/finance.

All I endup on screen though is array()

Any ideas.

Regards

What's the output of print_r? Also this guy has some reasonable answer to what you're trying to do: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — halfdan, Apr 21 '11 at 08:32
make sure you are allowed to read files from urls (a security risk in badly implemented systems) — knittl, Apr 21 '11 at 08:33
@halfdan, I think that's what is printing an empty array. Are you sure that $page_all has what you expect? — jmathai, Apr 21 '11 at 08:37

Daniel Sloof · Answer 1 · 2011-04-21T09:14:16.587

1

Don't use regex for these kind of things, try a DOM parser such as SimpleHTMLDom.

<?php 
require_once('simple_html_dom.php');
echo file_get_html('http://www.google.com/finance')->find('#markets', 0);
?>

Yeah... it's that easy :)

edit:

In response to your comment, behold the awesomeness of SimpleHTMLDom:

<?php 
require_once('simple_html_dom.php');

$html = file_get_contents('http://www.google.com/finance');
$tidy = tidy_parse_string($html);
$tidy->cleanRepair();
$html = str_get_html((string)$tidy);

foreach($html->find('#markets .quotes', 0)->find('tr') as $line) {
    printf("%s - %s - %s %s<br />", 
        $line->find('.symbol a', 0)->innertext,
        $line->find('.price span', 0)->innertext,
        $line->find('.change span', 0)->innertext,
        $line->find('.change span', 1)->innertext);
}
?>

Yeah, I had to use Tidy for that page... I don't know who Google hired to do that HTML but it's absolutely horrendous. Unclosed td's, multiple elements with same id's etc... Parser choked on those :(

edited Apr 21 '11 at 09:14

answered Apr 21 '11 at 08:45

Daniel Sloof

12,568
14
72
106

That is bloody awsome. Is there a way that I can incorporate strip tags to remove the links? – Minikoopa Apr 21 '11 at 09:00
I get Fatal error: Call to undefined function tidy_parse_string() on line 5 – Minikoopa Apr 21 '11 at 09:21
@Minikoopa: that means Tidy is not enabled on your installation... You have a couple of options: 1) enable tidy module in your php installation 2) find a dom parser that can handle google's ridiculously malformed html 3) resort to regular expressions to parse whatever simplehtmldom can't 4) get google to change their website (good luck :D) – Daniel Sloof Apr 21 '11 at 09:24
No. 1 sounds like the easiest option. Thanks your help really appreciate it. Will let you know how I get on. – Minikoopa Apr 21 '11 at 09:32

score 0 · Answer 2 · answered Apr 21 '11 at 08:50

I have not found <div id="markets"> in 'http://www.google.com/finance' HTML-page, but found <div id=markets>, then try:

<?php
$source='http://www.google.com/finance';
//$source='sample.txt';
$page_all = file_get_contents($source);
$div_array=array();
preg_match_all('#<div id=markets>(.*?)</div>#sim', $page_all, $div_array);
//print_r($div_array);
print_r($div_array[1]);
?>

Get contents page using regular expression

2 Answers2