0

I have a list of URLs and I want to download them in order to create an index in webtrec format. I've found an useful framework called MapReduce (Apache Hadoop) but I'd like to know if there is an implementation in java of what I want to do. Or may be a close example of it.

Thank you!

synack
  • 1,699
  • 3
  • 24
  • 50
  • possible duplicate of [Simple Java Map/Reduce framework](http://stackoverflow.com/questions/5260212/simple-java-map-reduce-framework) – ant May 09 '12 at 12:35
  • You might want to look into Nutch - http://nutch.apache.org/ – Chris White May 10 '12 at 02:57

1 Answers1

1

MapReduce pattern is a pattern for parallelizable, CPU-bound computations in multiple steps. Downloading and crawling web pages is an I/O-bound operation. Hence, you should differentiate both operations.

So you should first use something like a queue and asynchronous I/O for downloading web sites when performance is really that important. In a second step, you can then use MapReduce for building the actual index.

Hadoop is one possibility, but if you're not targeting large scale, frameworks such as Fork/Join and akka may be applicable as well.

b_erb
  • 20,932
  • 8
  • 55
  • 64