0

a. i am using a simple url crawler (How do I make a simple crawler in PHP?) on a xyz.com/items/advsearch. The page lists the results after someone does advanced search (which lists all results). I have to copy/scrap those results. Now, when i go to "Next>" page, its url is encoded and crawling to the url of "Next>" brings me back to the main page of Advanced Search which shows 0 results.

b. Another thing i noted is: On simple human url traversal, the url of 'Next>' does not have jessionid in it as paramter, while, on getting html of page using file_get_contents(), it has it. Why is this so??

I am finding it quite difficult to mess with encoded urls /sessions stuff that i cannot crawl! Urgent help needed.

Community
  • 1
  • 1
UserBSS1
  • 2,091
  • 1
  • 28
  • 31
  • Does your crawling method handle cookies? (If it's a directed scan and sessions without side-effects, then enable that.) – mario Dec 29 '11 at 13:38
  • No, it does not handle cookies. But i did tried the existing library of PHPCrawler 0.7 (free). It does handle cookies. But the problem lies with jessionid. This crawler also extracts the URLs encoded with sessionid. If I strip the jsessionid from the url, even then, the page takes back to the home page. – UserBSS1 Dec 29 '11 at 14:16

1 Answers1

0

The jsessionid is usually stored and sent in a cookie. Adding it to link URLs is only a fallback if the Java application notices that the client may not support cookies. That's probably why the jsessionid parameter is not added to the URLs when traversing the pages with a web browser, because your browser does handle cookies properly, while your PHP script does not.

Jan
  • 2,498
  • 1
  • 15
  • 6