0

enter image description hereI need to know input data size of each task .Which class in hadoop can help me? is FileInputFormat.java helpful ?how to use it? it needs some input,What are they?

mndn
  • 73
  • 8
  • Have a look at : https://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries/34737075#34737075 – Ravindra babu May 30 '17 at 19:05
  • how can I make an object from org.apache.hadoop.mapreduce.lib.input.fileinputformat to use getsplit() method? – mndn May 30 '17 at 19:46

1 Answers1

0

The input size of the whole task is just the size of input files from hdfs.

The input size of eack mapper task is calculated according to the following propertiy (64 is the default size) :

mapreduce.input.fileinputformat.split.minsize=64Mb

Hadoop splits the input into pieces of size split size, which is equal to:

max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))

Use this property get the size of your input.

Viacheslav Shalamov
  • 4,149
  • 6
  • 44
  • 66
  • hadoop has logical split and physical split.the physical split is for hdfs ,but the logical split is for a task.maybe one task needs to process more than one physical block. – mndn May 30 '17 at 10:32
  • I need the input data for task. – mndn May 30 '17 at 10:32