1

Want to grab list of players from http://www.atpworldtour.com/Rankings/Singles.aspx

There is a table with class "bioTableAlt", we have to grab all the <tr> after the first one (class "bioTableHead"), which is used for table heading.

Wanted content looks like:

<tr class="oddRow">
 <td>2</td>
 <td>
  <a href="/Tennis/Players/Top-Players/Novak-Djokovic.aspx">Djokovic, Novak</a>
  (SRB)
 </td>
 <td>
  <a href="/Tennis/Players/Top-Players/Novak-Djokovic.aspx?t=rb">6,905</a>
 </td>
 <td>0</td>
 <td>
  <a href="/Tennis/Players/Top-Players/Novak-Djokovic.aspx?t=pa&m=s">21</a>
 </td>
</tr>
<tr>
 <td>3</td>
 <td>
  <a href="/Tennis/Players/Top-Players/Roger-Federer.aspx">Federer, Roger</a>
  (SUI)
  </td>
 <td>
  <a href="/Tennis/Players/Top-Players/Roger-Federer.aspx?t=rb">6,795</a>
 </td>
 <td>0</td>
 <td>
  <a href="/Tennis/Players/Top-Players/Roger-Federer.aspx?t=pa&m=s">21</a>
 </td>
</tr>

I think the best idea is to create an array(), make each <tr> an unique row and throw final code to the list.txt file, like:

Array (
 [2] => stdClass Object (
    [name] => Djokovic, Novak
    [country] => SRB
    [rank] => 6,905
 )
 [3] => stdClass Object (
    [name] => Federer, Roger
    [country] => SUI
    [rank] => 6,795
 )
)

We're parsing each <tr>:

  • [2] is a number from first <td>
  • [name] is text of the link inside second <td>
  • [country] is a value between (...) in second <td>
  • [rank] is the text of the link inside third <td>

In final file list.txt should contain an array() with ~100 IDS (we are grabbing the page with first 100 players).

Additionally, will be amazing, if we make a small fix for each [name] before adding it to an array() - "Federer, Roger" should be converted to "Roger Federer" (just catch the word before comma, throw it to the end of the line).

Thanks.

James
  • 42,081
  • 53
  • 136
  • 161
  • possible duplicate of [How to get string from HTML with regex?](http://stackoverflow.com/questions/3298293/how-to-get-string-from-html-with-regex) and [reqular expression problem in php](http://stackoverflow.com/questions/3382244/reqular-expression-problem-in-php/3382359#3382359) and [a couple others](http://stackoverflow.com/search?q=html+dom+php) - note that this is not to suggest you should Regex, but the the suggested DOM solutions. – Gordon Aug 09 '10 at 13:38
  • @Gordon - this topic is very different – James Aug 09 '10 at 13:39
  • 1
    no it is not different. You are asking how to fetch a specific node or nodeset from a Webpage. That is done with a DOM parser and XPath and there is plenty examples in the three links above. The only thing that they wont tell you is how to apply the name fix you are asking for. – Gordon Aug 09 '10 at 13:42
  • @Gordon maybe, my php knowledge isn't good. – James Aug 09 '10 at 13:48
  • that's okay. that's why I am telling you. But that's no reason to spoonfeed you the answer. – Gordon Aug 09 '10 at 13:52

2 Answers2

2

Below is how to do it with PHP's native DOM extension. It should get you halfway to where you want to go.

The page is quite broken in terms of HTML validity and that makes loading with DOM somewhat tricky. Normally, you can use load() to load a page directly. But since the HTML is quite broken, I loaded the page into a string first and used the loadHTML method instead, because it handles broken HTML better.

Also, there is only one table at that page: the ranking table. The scoreboards are loaded via Ajax once the page loaded, so their HTML will not show up in the source code when you load it with PHP. So you can simply grab all TR elements and iterate over them.

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTML(
    file_get_contents('http://www.atpworldtour.com/Rankings/Singles.aspx'));
libxml_clear_errors();

$rows = $dom->getElementsByTagName('tr');
foreach($rows as $row) {
    foreach( $row->childNodes as $cell) {
        echo trim($cell->nodeValue);
    }
}

This would output all table cell contents. It should be trivial to add those to an array and/or to write them to file.

Gordon
  • 312,688
  • 75
  • 539
  • 559
1

SimpleHTMLDOM will make this very easy for you.

The first few lines would look something like this (untested):

// Create DOM from URL or file
$html = file_get_html('http://www.atpworldtour.com/Rankings/Singles.aspx');

// Find all images 
foreach($html->find('table[id=bioTableAlt] tr[class!=bioTableHead]') as $element) 
    {

    }

(not sure about the tr[class!=bioTableHead], if it doesn't work, try a simple tr)

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Will try, actually I want only text and no images. – James Aug 09 '10 at 13:32
  • 2
    Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Aug 09 '10 at 13:32
  • @Gordon you totally have a point, as always. Haven't looked at phpQuery before, that one looks like it could become my new favourite :) – Pekka Aug 09 '10 at 13:34
  • please tell how to catch different with SimpleHTMLDOM, like :nth-child(1) a {} – James Aug 09 '10 at 13:37
  • @Ignatz see http://simplehtmldom.sourceforge.net/manual.htm "How to traverse the DOM tree?" – Pekka Aug 09 '10 at 13:39