4

Off late I have been doing some work on Web Scraping. After some research and analysis I could get a hang of it. But I have stuck to some point which I am not able to find suitable answers even after googling. The point I have stuck is, through web scraping, I log into intranet page with log in user & password, For a given URL in my code I am able to fetch the data but when the URL changes my code fails to log in because of the reason that the code has hit wrong URL. Now the code which hits the link is kind of Agent which on refresh command hits the URL.

I would like to know any good Tool or some book which can help me to understand on Applying artificial intelligence on Web scraping. with this I can dynamically handle my agents without re-configuring it manually. Any help could be of great pleasure.

David Storey
  • 29,166
  • 6
  • 50
  • 60
chaosguru
  • 1,933
  • 4
  • 30
  • 44

1 Answers1

0

If the links change often, you could read the headers sent from the old link and see if there are headers to redirect you to new links

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.3

these are the html redirect codes

I don't know what software are you using for scraping, but I'm sure that it can handle redirect following.

for example: in CURL written in php the following code is used to follow redirects

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//FROM http://stackoverflow.com/questions/3519939/make-curl-follow-redirects

To answer your request

I would like to know any good Tool or some book which can help me to understand on Applying artificial intelligence on Web scraping

PHP is a good tool to understand basic web scraping but it's not as fast as you would imagine. The fastest technology I know to do this is ERLANG. But it's not that friendly to newcomers.