1

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information. How could this be done?

Please guide me.

Duplicate: How to implement a web scraper in PHP?

Community
  • 1
  • 1
praveenjayapal
  • 37,683
  • 30
  • 72
  • 72
  • Yes, a duplicate. But this one goes more into accessing sites that require authentication. – Ross Feb 06 '09 at 13:13

6 Answers6

1
Zend_Http_Client and Zend_Dom_Query
0

Curl, and once ure in, use QueryPath php library. (querypath.org) You can access dom elements just like in JQuery, via CSS selectors, there's method chaining...

Way better than just using php's native xml functions.

It also works as drupal extension, but I suppose you could implement it in any php project.

tonino.j
  • 3,837
  • 28
  • 27
0

You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.

Once you have the page, you're probably best off using regular expressions to scrape the data you want.

Greg
  • 316,276
  • 54
  • 369
  • 333
  • 10
    -1 Sorry but this issue has come up time and time again: regex is a terrible way to do scraping. Use an HTML/XML parser. Regexes are so error prone for this sort of thing it's not funny. – cletus Feb 06 '09 at 23:10
  • cletus I completely disagree. If you're looking to get a small piece of information from a blob of HTML, a regex is the way to go. – Greg Feb 06 '09 at 23:43
0

You should look look at curl.

benlumley
  • 11,370
  • 2
  • 40
  • 39
0

You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.

How easy it would be to call from PHP I don't know though.

andynormancx
  • 13,421
  • 6
  • 36
  • 52
  • 1
    -1 Beautiful Soup is fine if it's Python but this isn't. There are PHP libraries (like Zend and Simple XML) for this. Calling Python is not a sensible solution. – cletus Feb 06 '09 at 23:11
  • 1
    Seems a little harsh. I don't know a that much about Simple XML and Zend, but Googling suggests SimpleXML is just an XML parser and Zend is an app server. I fail to see how either of those help in any specific way in the hard problem of scraping HTML in the way that something like BS would. – andynormancx Feb 06 '09 at 23:54
  • Zend is also a framework of many different packages. And that's kinda my point: your knowledge of PHP is sketchy (it seems) so suggesting Python (something I presume you know more about based on your answer) doesn't really help. – cletus Feb 07 '09 at 00:36
  • 1
    So Zend has a package designed for parsing badly formatted HTML as found on most websites then ? If it has nobody seems to have recommended it here. Is there such a package ? – andynormancx Feb 07 '09 at 09:05
  • 1
    I know enough about PHP to know that it can shell out to another app . So running a quick Python script to make use of BS to make the HTML parseable should work. If I was looking at scraping potential lousy HTML it is definitely what I would try first, before attempting to roll my own. – andynormancx Feb 07 '09 at 09:10
0

You could also check out http://php.net/dom

middus
  • 9,103
  • 1
  • 31
  • 33