3

I am relatively new to the whole idea for HTML parsing/scraping. I was hoping that I could come here to get the help that I need!

Basically what I am looking to do (i think), is specify the url of the page I wish to grab the data from. In this case - http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/

From there, I want to grab the table class=listing in the div id=snapshot_table.

I then wish to embed that table onto my own page and have it update when the original content is updated.

I have read a few of the other posts on Google and Stackoverflow, I also had a look at a tutorial on Nettuts+ but it just seemed to be a bit too much to take in at once.

Hopefully someone here can help me out and make this as simple as possible :)

Cheers,

Mat

--Edit--

Current code as of 11:22am (GMT+10)

<?php
    # don't forget the library
    include('simple_html_dom.php');
?>
<html>
</head>
<body>
<?php
    $html = file_get_html('http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/');
    $table = $html->find('#snapshot_table table.listing');
    print_r($table);
?>
</body>
</html>
Mathew Hood
  • 675
  • 2
  • 7
  • 19
  • Would you like to to perform the scraping/parsing of the website with jQuery? You will need a server-side proxy then, because you can't load something from another domain via AJAX. – Ewout Kleinsmann Jul 25 '11 at 00:34
  • Hey Ewout,I am really happy to use whatever method will be the most effetive. The content of the table is really only updated 3 times a week at most, so it doesn't have to be updated asap. – Mathew Hood Jul 25 '11 at 00:48
  • This sounds like a job for PHP using CURL. – Tomm Jul 25 '11 at 00:51
  • Really? I would have thought that I could just use jQuery or a similar method to retrieve the data inside the div and then echo it onto my own website? Is it more complicated then that, or have I miss-said the question? – Mathew Hood Jul 25 '11 at 00:54

2 Answers2

4

I think I got it to work, and I learned a lot! :)

<?php
//Get the current timestamp
$url = 'http://www.epgpweb.com/api/snapshot/us/Caelestrasz/Crimson';
$url = file_get_contents($url);
$url = substr($url,-12,10); 

//Get the member data based on the timestamp
$url = 'http://www.epgpweb.com/api/snapshot/us/Caelestrasz/Crimson/'.$url;
$url = file_get_contents($url);

//Convert the unicode to html entities, as I found here: http://stackoverflow.com/questions/2934563/how-to-decode-unicode-escape-sequences-like-u00ed-to-proper-utf-8-encoded-char
function replace_unicode_escape_sequence($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}
$url = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $url);

//erase/replace the insignificant parts, to put the data into an array
function erase($a){
    global $url;
    $url = explode($a,$url);
    $url = implode("",$url);
}
function replace($a,$b){
    global $url;
    $url = explode($a,$url);
    $url = implode($b,$url);    
}
replace("[[",";");
replace("]]",";");
replace("],",";");
erase('[');
erase('"');
replace(":",",");
$url = explode(";", $url);

//lose the front and end bits, and maintain the member data
array_shift($url);
array_pop($url);

//put the data into an array
foreach($url as $k=>$v){
    $v = explode(",",$v);
    foreach($v as $k2=>$v2){
        $data[$k][$k2] = $v2;
    }
    $pr = round(intval($data[$k][1]) / intval($data[$k][2]),3);
    $pr = str_pad($pr,5,"0",STR_PAD_RIGHT);
    $pr = substr($pr, 0, 5);
    $data[$k][3] = $pr;
}

//sort the array by PR number
function compare($x, $y)
{
if ( $x[3] == $y[3] )
 return 0;
else if ( $x[3] > $y[3] )
 return -1;
else
 return 1;
}
usort($data, 'compare');

//output the data into a table
echo "<table><tbody><tr><th>Member</th><th>EP</th><th>GP</th><th>PR</th></tr>";
foreach($data as $k=>$v){
    echo "<tr>";
    foreach($v as $v2){ 
        echo "<td>".$v2."</td>";
    }
    echo "</tr>";
}
echo "</tbody></table>";
?>
bozdoz
  • 12,550
  • 7
  • 67
  • 96
  • 1
    Bozdoz, you, are, a, BEAST! Thankyou so much! That is so much more than I ever expected! How long did it take you? 19020913 upvotes for you! – Mathew Hood Jul 25 '11 at 03:49
  • 1
    It took me a half hour I think. I didn't know a lot of it, but I like answering these questions as a way to learn for myself. It may not be the most beautiful code, but it works. At least you don't have to 'include' any other external code. Hope it all makes sense; I tried to be generous with my code comments, but it would have looked foreign to me just an hour ago. :) – bozdoz Jul 25 '11 at 03:54
  • Haha :) One last question before i do. The URL that Ewout used will become outdated once a new set of scores is posted up. I just had another look at the code on the original page (http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/). If you jump in firebug and go to the DOM, you can see in gaGlobal, the sid is the number at the end of the URL. Would it be possible at all to do grab that and store it as a variable, so the most recent sid set of results is loaded into the table? – Mathew Hood Jul 25 '11 at 03:56
  • I can pseudo code what I am trying to do but the actual code is far from my knowledge levels haha. You are a machine :)! – Mathew Hood Jul 25 '11 at 04:41
  • 1
    That was way harder than expected. I don't really know much about how to get session ids; luckily, I noticed that it was pulling the timestamp from another page, at http://www.epgpweb.com/api/snapshot/us/Caelestrasz/Crimson . – bozdoz Jul 25 '11 at 04:45
  • 1
    I found it in the XHR section in the NET tab of Firebug, btw. :) – bozdoz Jul 25 '11 at 04:51
  • Oh sorry, I didn't even notice that you had updated it! I was just refreshing the page waiting for you to post again! That is awesome :) Thankyou so much :)! – Mathew Hood Jul 25 '11 at 05:38
1

Take a look at the PHP simple_html_dom class.

Next this will do the trick.

$html = file_get_html('http://www.epgpweb.com/guild/us/Caelestrasz/Crimson/');
$table = $html->find('#snapshot_table table.listing');
Ewout Kleinsmann
  • 1,287
  • 1
  • 9
  • 20
  • This is similar to what I had looked at previously. I have added in echo $table; to try and print the list down the page, however it only returns the word Array. Is this an obvious mistake I am making or am I missing something? – Mathew Hood Jul 25 '11 at 01:12
  • 1
    Yes, it is ;-) try print_r($table) – Ewout Kleinsmann Jul 25 '11 at 01:15
  • Haha, I placed that in and pretty much it returned a tonne of code. http://testing.lifestyletrader.com/DOM/ Take a look for yourself! Haha – Mathew Hood Jul 25 '11 at 01:17
  • 1
    @Matthew Hood: could you place your code in your initial post? There's obviously going something wrong. Also try adding ->plaintext after ->find. And use echo again instead of print_r. – Ewout Kleinsmann Jul 25 '11 at 01:21
  • Updated original post, trying your suggestion now! – Mathew Hood Jul 25 '11 at 01:24
  • Changing it back to echo, and adding ->plaintext after ->find resulted in no information returned. – Mathew Hood Jul 25 '11 at 01:37
  • 1
    mm, strange... What is it by the way you're trying to accomplish? If you trying to capture the standings list on that website you're going at it the wrong way. This list is dynamically loaded with AJAX. This link contains the list in JSON format: http://www.epgpweb.com/api/snapshot/us/Caelestrasz/Crimson/1311511740 – Ewout Kleinsmann Jul 25 '11 at 01:41
  • That is precisely what I am trying to do. I just want to show that same table of standings on a personal website as opposed to the one it is currently on. Is there a way to grab the list after it is compiled in AJAX? I am unfamiliar with JSON and how it would be implemented. The snapshot code that you just linked from the API is readily available for me to update the data on the website it is currently on. – Mathew Hood Jul 25 '11 at 01:48
  • 1
    There are ways to do that, but those are quite complex. I suggest you take a look at jQuery templates (Google it). They make it really easy to parse the JSON. – Ewout Kleinsmann Jul 25 '11 at 01:55
  • We clearly have different ideas of what is really easy :P I am taking a look now, thanks for all your help today! When I get more rep I will upvote you :)! – Mathew Hood Jul 25 '11 at 02:05