How to pull specific content from HTML using PHP?

Question

Possible Duplicate:
How to parse and process HTML with PHP?

How do I go about pulling specific content from a given live online HTML page?

For example: http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967

I want to retrieve the text description, the path to the main image and the price only. So basically, I want to retrieve content which is inside specific divs with maybe specific IDs or classes inside a html page.

Psuedo code

$page = load_html_contents('http://www.gumtr..');
$price = getPrice($page);
$description = getDescription($page);
$title = getTitle($page);

Please note I do not intend to steal any content from gumtree, or anywhere else for that matter, I am just providing an example.

I'm not intending to steal any content from gumtree, it's just an example. — emkay, Jan 04 '12 at 10:56
hope this link will help you to get data from different url http://stackoverflow.com/questions/6810735/html-parsing-get-data-from-a-table-inside-a-div — jogesh_pi, Jan 04 '12 at 11:32

score 2 · Answer 1 · answered Jan 04 '12 at 10:56

First of all, what u wanna do, is called WEBSCRAPING. Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc. Search after webscraping.

HERE is a basic tutorial

THIS book should be useful too.

score 2 · Answer 2 · answered Jan 04 '12 at 13:47

something like this would be a good starting point if you wanted tabular output

$raw=file_get_contents($url) or die('could not select');
$newlines=array("\t","\n","\r","\x20\x20","\0","\x0B","<br/>");
$content=str_replace($newlines, "", html_entity_decode($raw));
$start=strpos($content,'<some id> ');
$end = strpos($content,'</ending id>');
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
 // array to vars
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$var1= strip_tags($cells[0][0]); 
$var2= strip_tags($cells[0][1]);

etc etc

file_get_contents might need to be replaced with curl. file_get_contents has been disabled on my shared hosting account for example... but does work on localhost — , Jan 04 '12 at 13:49

Igor Parra · Accepted Answer · 2012-01-04T19:02:25.433

The tutorial Easy web scraping with PHP recommended by robotrobert is good to start, I have made several comments in it. For a better performance use curl. Among other things handles HTTP headers, SSL, cookies, proxies, etc. Cookies is something that you must pay attention.

I just found HTML Parsing and Screen Scraping with the Simple HTML DOM Library. Is more advanced, facilitates and speed up the page parsing through a DOM parser (instead regular expressions --enough hard to master and resources consuming). I recommend you this last one 100%.

How to pull specific content from HTML using PHP?

3 Answers3