In our cluster the dfs.block.size is configured 128M, but I have seen quite a few files which is of the size of 68.8M which is a weird size. I have been confused on how exactly this configuration option affects how files look like on HDFS.
- First thing I wish to make sure is that, will ideally files all of the size of the block size that already configured? Here I mean ideally file and block in a one-on-one mapping
- If the files are not inherently small but are generated by MR jobs, what can be the possible cause of these small files?
- One more point to add is that we are using the hive dynamic partitioning function which I am not sure if is one source of the problems. For the source of small files I have checked this blog but it The small files Problem
But the situations don't really match mine which makes my confusion remains. Hope anyone could give me some insight on that. Thanks a lot in advandce.