I am trying to put together a list of bios for names found in some websites.
I have names and corresponding websites :
name website
-----------------
John Doe abc.com
Steve J apple.com
For instance, I want to search for John Doe @ http://abc.com
I want to fetch the urls in that site where John Doe was found :
ex :
http://abc.com/board/programmers.php
http://abc.com/team/list.php
http://abc.com/index/welcome.php
Of course I want to conform to robots.txt on each website. I am not data mining, I already know that a person 'X' is associated to a website 'Y' to list his bio. I am sure the website admin won't mind that!
I came across Scrapy but I don't know the exact URL where the name is found on a website. All I have is the root of the website and I want the crawler to crawl through each linked page.
While typing this I started wondering why not key in the search query and website in to Google and retireve the result all in an automated fashion - but Google doesn't let you do that in their ToS I assume.