0

How can I screen scrape a website using cURL and show the data within a specific div?

user272899
  • 871
  • 5
  • 16
  • 22

4 Answers4

6

Download the page using cURL (There are a lot of examples in the documentation). Then use a DOM Parser, for example Simple HTML DOM or PHPs DOM to extract the value from the div element.

Yacoby
  • 54,544
  • 15
  • 116
  • 120
  • 3
    Please comment when downvoting to give me the chance to correct or otherwise improve my answer. – Yacoby Mar 26 '10 at 16:38
0

After downloading with cURL use XPath to select the div and extract the content.

hoju
  • 28,392
  • 37
  • 134
  • 178
-1

Fetch the website content using a cURL GET request. There's a code sample on the curl_exec manual page.

Use a regular expression to search for the data you need. There's a code sample on the preg_match manual page, but you'll need to do some reading up on regular expressions to be able to build the pattern you need. As Yacoby mentioned which I hadn't thought of, a better idea may be to examine the DOM of the HTML page using PHP's Simple XML or DOM parser.

Output the information you've found from the regex/parser in the HTML of your page (within the required div.)

Community
  • 1
  • 1
Andy Shellam
  • 15,403
  • 1
  • 27
  • 41
  • 1
    The
    cannot hold ... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454
    – Tim Post Apr 06 '10 at 17:43
-1

A possible alternative.

# We will store the web page in a string variable.
var string page

# Read the page into the string variable.
cat "http://www.abczyx.com/path/to/page.ext" > $page

# Output the portion in the third (3rd) instance of "<div...</div>"
stex -r -c "^<div&</div\>^3" $page

This code is in biterscripting. I am using the 3 as sample to extract 3rd div. If you want to extract the div that has say string "ABC", then use this command syntax.

stex -r -c "^<div&ABC&</div\>^" $page

Take a look at this script http://www.biterscripting.com/helppages/SS_ExtractTable.html . It shows how to extract an element (div, table, frame, etc.) when the elements are nested.

P M
  • 1