1

I am trying to put together a list of bios for names found in some websites.

I have names and corresponding websites :

name      website
-----------------
John Doe  abc.com
Steve J   apple.com

For instance, I want to search for John Doe @ http://abc.com

I want to fetch the urls in that site where John Doe was found :

ex :

http://abc.com/board/programmers.php
http://abc.com/team/list.php
http://abc.com/index/welcome.php

Of course I want to conform to robots.txt on each website. I am not data mining, I already know that a person 'X' is associated to a website 'Y' to list his bio. I am sure the website admin won't mind that!

I came across Scrapy but I don't know the exact URL where the name is found on a website. All I have is the root of the website and I want the crawler to crawl through each linked page.

While typing this I started wondering why not key in the search query and website in to Google and retireve the result all in an automated fashion - but Google doesn't let you do that in their ToS I assume.

ThinkCode
  • 7,841
  • 21
  • 73
  • 92
  • http://www.google.com/search?q=%22John+Doe%22+site:abc.com#q=%22John+Doe%22+site:abc.com&hl=de&prmd=ivnso&filter=0&bav=on.2,or.r_gc.r_pw.&fp=479bfbd97c80bbdb&biw=1399&bih=928 you could try [google's api](http://code.google.com/intl/de-DE/apis/customsearch/v1/overview.html) – Jochen Ritzel Aug 17 '11 at 21:29
  • Going through Google APIs that let me do this and retrieve results (title, url, website snippet). – ThinkCode Aug 17 '11 at 21:33

1 Answers1

0

Using a search engine, either by scraping it or by using their API (if you can follow their Terms of Use), is definitely the way to go here.

See for example how to do it with DuckDuckGo.

Gallaecio
  • 3,620
  • 2
  • 25
  • 64