Okay, so I have been looking for the last two days for a crawler that suits my needs. I want to build a search engine and I want to do the indexing myself. This will be part of an academic project. Although I do not have the processing power to crawl the entire web, I would like to use a crawler that is actually capable of doing this. So what I am looking for is a crawler that:
- supports multithreading
- doesn't miss many links
- gives me the opportunity to (override a method so that I can) access the content of the pages crawled so that I can save it, parse it etc.
- obeys robots.txt files
- crawls html pages (also php,jsp etc.).
- recognizes pages with same content and only returns one.
What it doesn't (necessarily) have to do is:
- supporting pageranking.
- index results.
- crawl images/audio/video/pdf etc.
I found a few libraries/projects that came very close to my needs, but as far as I know they don't support everything I need:
- First I came across crawler4j. The only problem with this one is that it doesn't support politeness interval per host. Therefore, by setting the politeness level to a decent value of 1000ms, makes the crawler terribly slow.
- I also found flaxcrawler. This did support multithreading but it appears to have problems with finding and following links in webpages.
I also looked at more complete and complex 'crawlers' such as Heritrix and Nutch. Although I am not that good with more complex stuff but I am definitely willing to use it if I am sure that it would be able to do what I need it to do: crawl the web and give me all the pages so that I can read them.
Long story short: I am looking for a crawler that goes very fast through all pages on the web and gives me the opportunity to do something with them.