1

I've been trying to simply extract "the next episode number" from a TV episodes tracking website. Here's an example page:

Example page

Scroll down and you'll see "Countdown", "Date", "Season" and "number". I'd like to extract that number.

I've been looking at the source code as well as Simple HTML DOM to try and work something out but I failed multiple times. The "number" has the class "nextEpInfo" but the "Countdown", "season"...etc have the same class as well.

How would I go about extracting it?

Also if possible I'd really appreciate some good references that explain the method that you recommend as I'd ideally like to learn how to deal with these situations in the future when content I need extracted is wrapped inside different classes, divs...etc.

j0k
  • 22,600
  • 28
  • 79
  • 90
user1788210
  • 67
  • 1
  • 5
  • 1
    Beside matching the attribute (ie. css class), you would need to match the text, here is a related question : http://stackoverflow.com/questions/3655549/xpath-containstext-some-string-doesnt-work-when-used-with-node-with-more – ajreal Nov 05 '12 at 10:45
  • @ajreal Thanks for your post. However, I couldn't really tie things together since I was attempting to extract the number via Simple HTML Dom http://simplehtmldom.sourceforge.net/ so it was really hard for me to understand the answer you referenced. Is it possible to provide sample code if you have the time? Thanks! – user1788210 Nov 05 '12 at 11:22

4 Answers4

1

If you have the raw HTML of the page you want to parse you can use a preg_match to find it.

If you don't have the HTML this should help you: How do I get the HTML code of a web page in PHP?

preg_match()

This function lets you parse a string with a regular expression pattern. It would be recommended to get only a fraction of the HTML to parse, not all the page. For example, in this case I would try to get the HTML of the first table (the one that doesn't have info of the previous episode).

$subject="the HTML of the url you want to parse";
$pattern='/Number:<\/td><td.+?>(\d+)<\//';
if(preg_match($pattern, $subject, $hits)){
    echo "Number: $hits[0]";
}

In case you don't know how a regular expression works:

'.' is a reserved character that means 'any character', the '+' right after it means 'one or more than one' and the '?' makes the regular expression non-greedy. So if we sum it up '.+?' means 'one or more of any character, but make it as short as possible'.

'(' and ')' indicates we want to retrieve what is between them, and '\d' means a number. So '(\d+)' means 'put that combination of numbers in the $hits array'.

If you use the same regular expression but with preg_match_all you would retrieve all the numbers of the web that follow that same pattern, they would be inside the $hits array.

Community
  • 1
  • 1
Naryl
  • 1,878
  • 1
  • 10
  • 12
  • Thanks a lot for the tip. After around 5 hours working on this, I managed to accomplish it using the Simple HTML DOM without any complications. I used a combination of CURL + HTML DOM. I'll post what I did for others to take a look if they wish. I had an important question though. I know how to retrieve a page through file_get_contents() and curl, but I have no idea how to retrieve "part of it" for the purpose. Appreciate any help on that. – user1788210 Nov 05 '12 at 17:16
  • Well, you could use any XML or DOM parser to get the part you want with XPath or you could use a regular expression just like before to get for example all the '/(.+?Number:.+?<\/table>)/', this should give you all the tables with a 'Number:' inside of any of its td/th.
    – Naryl Nov 06 '12 at 09:05
0

This can be done using Xpath:

(//td[contains(text(), 'Number')])[1]/../td[2]

This query navigates to the first td where the text equals Number. It then goes to the parent node (/../) of that children, and then to the second td (td[2]), which contains the next episode number.

Firebug allows you to test Xpath queries in the console, using $x:

$x("(//td[contains(text(), 'Number')])[1]/../td[2]");

To use this with PHP, check out DOMDocument and DOMXpath. More specific DOMDocument.loadHtml and DOMXpath.query.

alexn
  • 57,867
  • 14
  • 111
  • 145
0

Below is a sample pseudocode that you can use:

1) Retrieving all the tr with class nextEpInfo:

foreach($html->find('tr.nextEpInfo') as $tr)

2) For each of tr, verify whether they contain any of your keywords with stristr. Example: if(stristr($tr, 'Countdown') !== FALSE)

3) If this is the case extract the text contents for the 2 tds under the tr: $tds = $tr->find('td')

4) Get the desired value from the 2nd td: $tds[1]->plaintext

Ngo Hung
  • 754
  • 7
  • 9
0
<?php
/*

<tr class="nextEpInfo">
<td width="160" align="right" nowrap="" class="nextEpInfo">Season:    </td>
<td class="nextEpInfo" width="300">4</td>
</tr>
*/
$url = 'http://next-episode.net/the-good-wife';
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, 1 );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($ch,CURLOPT_ENCODING, 1);
curl_setopt( $ch, CURLOPT_REFERER, $url );
$content = curl_exec ($ch);
//echo $content;
$matches = array();
preg_match_all( '/class="nextEpInfo">(.+):<\/td>\s*<td[^>]*>(\d*)</', $content, $matches );
print_r( $matches );

or similar, which is the simplest and is going to work as far as the site's owner doesn't change the strings. using xpath or other xml/html parser could be an overhead for two strings to match and can brake the same way if the content on the site is changed.

Michael Tabolsky
  • 3,429
  • 2
  • 18
  • 11