2

Okay, so I have been looking for the last two days for a crawler that suits my needs. I want to build a search engine and I want to do the indexing myself. This will be part of an academic project. Although I do not have the processing power to crawl the entire web, I would like to use a crawler that is actually capable of doing this. So what I am looking for is a crawler that:

  1. supports multithreading
  2. doesn't miss many links
  3. gives me the opportunity to (override a method so that I can) access the content of the pages crawled so that I can save it, parse it etc.
  4. obeys robots.txt files
  5. crawls html pages (also php,jsp etc.).
  6. recognizes pages with same content and only returns one.

What it doesn't (necessarily) have to do is:

  1. supporting pageranking.
  2. index results.
  3. crawl images/audio/video/pdf etc.

I found a few libraries/projects that came very close to my needs, but as far as I know they don't support everything I need:

  1. First I came across crawler4j. The only problem with this one is that it doesn't support politeness interval per host. Therefore, by setting the politeness level to a decent value of 1000ms, makes the crawler terribly slow.
  2. I also found flaxcrawler. This did support multithreading but it appears to have problems with finding and following links in webpages.

I also looked at more complete and complex 'crawlers' such as Heritrix and Nutch. Although I am not that good with more complex stuff but I am definitely willing to use it if I am sure that it would be able to do what I need it to do: crawl the web and give me all the pages so that I can read them.

Long story short: I am looking for a crawler that goes very fast through all pages on the web and gives me the opportunity to do something with them.

Marco
  • 21
  • 6
  • What is the actual application requirement? – Peter Wooster Jan 30 '13 at 12:00
  • Sorry can you clarify? I do not exactly understand what you mean with the actual application requirement. Do you refer to the amount of links to be crawled? Or what the application should be capable of doing? Or something else? – Marco Jan 30 '13 at 12:06
  • What do you want to be able to achieve with this? – Peter Wooster Jan 30 '13 at 12:20
  • Well, for my Master's thesis, I am developing a user-generated content-based search engine. I think that this gives you an idea of the concept: http://en.wikipedia.org/wiki/Wikia_Search. Although the content of the search engine is mostly manipulated by people, there has to be done crawling as well. Is this a clear answer? – Marco Jan 30 '13 at 12:25
  • The fact that this is an academic project may help people give answers. You should include ht I your question. – Peter Wooster Jan 30 '13 at 15:32

1 Answers1

0

AFAIK, Apache Nutch suits most of your requirements. Nutch also has a plugin architecture which is helpful to write your own if you need. You can go through the wiki [0] and ask in the mailing list if you have any questions

[0] http://wiki.apache.org/nutch/FrontPage

kich
  • 734
  • 2
  • 9
  • 23