My Hive table has multiple small orc files. The size of each file is less than the HDFS block size. This is a big waste. I use the follow Spark codes to merge the small files but the total size of merged files is nearly 3 times larger than that of the original small files.
JavaSparkContext sc = startContext("test");
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.read().format("orc").load(inputPath);
logger.info("source schema:");
logger.info(df.schema().treeString());
DataFrame df2 = df.repartition(partitionNum);
logger.info("target schema:");
logger.info(df2.schema().treeString());
df2.write().mode("append").orc(outputPath);
closeContext(sc);
Is there someone meet the same problem? Thanks