0

We have dfs.blocksize set to 512MB for one of the map reduce jobs which is a map only job. But, some of the mappers are outputting more than 512 MB. ex: 512.9 MB.

I believe, the mapper block size should be restrained by the dfs.blocksize. Appreciate any inputs. Thanks

Kans
  • 382
  • 3
  • 17
  • 1
    File size != block size – OneCricketeer Apr 28 '18 at 00:18
  • Stack Overflow is a site for programming and development questions. This question appears to be off-topic because it is not about programming or development. See [What topics can I ask about here](http://stackoverflow.com/help/on-topic) in the Help Center. Perhaps [Super User](http://superuser.com/) or [Unix & Linux Stack Exchange](http://unix.stackexchange.com/) would be a better place to ask. – jww May 02 '18 at 04:40

2 Answers2

1

I believe, the mapper block size should be restrained by the dfs.blocksize.

This is not true. Files can be larger than block size. They'll just span multiple blocks in that case.

Hari Menon
  • 33,649
  • 14
  • 85
  • 108
1

Mappers do not save their outputs in HDFS - they use regular file systems for saving results - this is done to not replicate temporary data accross server in HDFS cluster. So, HDFS block size has nothign to do with mappers' output file size.

alex-arkhipov
  • 72
  • 1
  • 7
  • Mappers can save to HDFS. Have you not seen a part-m-0000 file? – OneCricketeer Apr 28 '18 at 15:09
  • @cricket_007 - you can set mapper to save to HDFS or it will be saved there if no reduce jobs are run. However, in most cases it is not. See [here](http://mlwiki.org/index.php/Hadoop_MapReduce) and [here](http://data-flair.training/forums/topic/in-map-reduce-why-map-write-output-to-local-disk-instead-of-hdfs) – alex-arkhipov Apr 28 '18 at 22:45
  • Sure. I'm just clarifying your statement "do not save", when, in fact, they can. – OneCricketeer Apr 29 '18 at 00:12
  • Well, keep in mind that MapReduce jobs can be Map-only jobs which does not have any shuffle involved. In that case, Mappers will write the final output directly to the HDFS (see here https://stackoverflow.com/questions/42621466/will-there-be-shuffle-and-sort-in-map-only-task/42621889#42621889) – dbustosp Apr 29 '18 at 00:13