0

What do I need to do to have smaller/larger blocks in Hadoop?

Concretely, I want to have larger number of mappers, that gets smaller piece of data to work on. It seems that I need to decrease the block size, but I'm confused (I'm new to Hadoop) - do I need to do something while putting the file on HDFS, or do I need to specify something related to input split size, or both?

I'm sharing the cluster, so I cannot perform global settings, so need this on a per-job basis, if possible? And I'm running the job from code (later from Oozie, possibly).

Kobe-Wan Kenobi
  • 3,694
  • 2
  • 40
  • 67

2 Answers2

1

What a mapper runs is controlled by the input split, and is completely up to you how you specify it. The HDFS block size has nothing to do with it (other than the fact that most splitters use the block size as a basic 'block' for creating the input splits in order to achieve good data locality). You can write your own splitter that takes an HDFS block and splits in 100 splits, if you so fancy. Aslo look at Change File Split size in Hadoop.

Now that being said, the wisdom of doing that ('many mappers with small splits') is highly questionable. Everybody else is trying to do the opposite (create few mappers with aggregated splits). See Dealing with Hadoop's small files problem, The Small Files Problem, Amazon Elastic MapReduce Deep Dive and Best Practices and so on.

Community
  • 1
  • 1
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Thank you very much for your answer. But won't I make myself a problem if the split size is for example 16MB and block size 64MB? I will loose data locality, and it will probably happen that mappers need to get their part of data from other node on which that quarter of the block is located? Isn't it better to use smaller block also, to "scale down Hadoop"? – Kobe-Wan Kenobi May 14 '15 at 09:02
  • The thing is, my data is not that big currently (and will probably grow up to a couple of GB), and on the other hand operations performed (mostly in mappers) are quite intensive, I'm talking about machine learning algorithms (concretely clustering) that performs some intense processing on a record basis. This way, my data is processed by small number of nodes (since the data is not that big), and on the other hand, they are chocking to perform their computations. Do you have some better advice what to do? – Kobe-Wan Kenobi May 14 '15 at 09:04
  • If the data is small but the computation intensive then loosing locality should have minimal impact. Submitting a job with `-Dmapreduce.input.fileinputformat.split.maxsize=...` is a trivial test to perform. Reformatting your FS to change the block size is rather expensive and has consequences (namenode has to track x4 blocks for instance). – Remus Rusanu May 14 '15 at 09:14
1

You dont really have to decrease the block size to have more mappers , that would process smaller amount of data.

You dont have to modify the HDFS block size ( dfs.blocksize ), let it be with th default global value as per your cluster configuration.

You may use mapreduce.input.fileinputformat.split.maxsize property in your job configuration with lower value than the block size.

The input splits will be calculated with this value and one mapper will be triggered for every input split calculated.

suresiva
  • 3,126
  • 1
  • 17
  • 23
  • Same question as for Remus - But won't I make myself a problem if the split size is for example 16MB and block size 64MB? I will loose data locality, and it will probably happen that mappers need to get their part of data from other node on which that quarter of the block is located? Isn't it better to use smaller block also, to "scale down Hadoop"? – Kobe-Wan Kenobi May 14 '15 at 09:05
  • if your mapper performs high intensive calculations per record, thus causes slowness in mapper progeress and yet if you have spare resources still available, then it is really fine to adjust the input split size to lower values. One thing you should consider is to balance between the resources available and mapper's execution. You should find the approprite lower values for the input split accordingly so that the parallel mappers can be well executed by the cluster. If there too many mappers beyond the cluster's ability, then they mostly go in sequence , this too would cause slowness. – suresiva May 14 '15 at 09:56
  • Thank you for your answer, both yours and Remus' are very good, but he provided a bit more info through external links, so I'm accepting his answer and giving you +1. – Kobe-Wan Kenobi May 14 '15 at 11:22