0

My Hive table has multiple small orc files. The size of each file is less than the HDFS block size. This is a big waste. I use the follow Spark codes to merge the small files but the total size of merged files is nearly 3 times larger than that of the original small files.

    JavaSparkContext sc = startContext("test");
    HiveContext hc = new HiveContext(sc);

    DataFrame df = hc.read().format("orc").load(inputPath);
    logger.info("source schema:");
    logger.info(df.schema().treeString());

    DataFrame df2 = df.repartition(partitionNum);
    logger.info("target schema:");
    logger.info(df2.schema().treeString());
    df2.write().mode("append").orc(outputPath);

    closeContext(sc);

Is there someone meet the same problem? Thanks

zjiash
  • 81
  • 5
  • Check for **compression properties** from Spark end. – mrsrinivas Feb 21 '17 at 07:46
  • The original orc files are generated by the same way. So, I think it is not the compression issue. Thanks. – zjiash Feb 21 '17 at 08:08
  • _"each file is less than the HDFS block size ... This is a big waste"_ - why "big"? The HDFS block size is just a MAX size; when a file reaches the block size then the DataNode-in-charge commits the block, and creates a new one (possibly using different peer DataNodes for replication). But at the end of the day a 10 KB file just takes 3x10 KB on the Linux filesystems. *(well, more likely 3x12 KB because of alignment but that's not a Big Data issue)* – Samson Scharfrichter Feb 21 '17 at 22:36
  • BTW, Hive has a *CONCATENATE* command for merging ORC files. It does not rebuild the inner stripes, just stacks them together in a single file => https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable/PartitionConcatenate – Samson Scharfrichter Feb 21 '17 at 22:39
  • Thanks. Using `CONCATENATE ` is enough for me. – zjiash Feb 22 '17 at 03:22

0 Answers0