Why cannot reduce the total file size by merging small ORC files?

Question

My Hive table has multiple small orc files. The size of each file is less than the HDFS block size. This is a big waste. I use the follow Spark codes to merge the small files but the total size of merged files is nearly 3 times larger than that of the original small files.

    JavaSparkContext sc = startContext("test");
    HiveContext hc = new HiveContext(sc);

    DataFrame df = hc.read().format("orc").load(inputPath);
    logger.info("source schema:");
    logger.info(df.schema().treeString());

    DataFrame df2 = df.repartition(partitionNum);
    logger.info("target schema:");
    logger.info(df2.schema().treeString());
    df2.write().mode("append").orc(outputPath);

    closeContext(sc);

Is there someone meet the same problem? Thanks

The original orc files are generated by the same way. So, I think it is not the compression issue. Thanks. — zjiash, Feb 21 '17 at 08:08
_"each file is less than the HDFS block size ... This is a big waste"_ - why "big"? The HDFS block size is just a MAX size; when a file reaches the block size then the DataNode-in-charge commits the block, and creates a new one (possibly using different peer DataNodes for replication). But at the end of the day a 10 KB file just takes 3x10 KB on the Linux filesystems. *(well, more likely 3x12 KB because of alignment but that's not a Big Data issue)* — Samson Scharfrichter, Feb 21 '17 at 22:36
BTW, Hive has a *CONCATENATE* command for merging ORC files. It does not rebuild the inner stripes, just stacks them together in a single file => https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate — Samson Scharfrichter, Feb 21 '17 at 22:39

Why cannot reduce the total file size by merging small ORC files?

0 Answers0

Linked