Use hive on spark to merge small file

Question

I would like to merge the output to 128mb per file in Hive. In the Spark, I have set up the following attributes, but it still can't work. Can someone give me a suggestion?

 val spark = SparkSession.builder
      .appName("MyExample")
      .master("local[*]")
      .enableHiveSupport()
      .getOrCreate()

spark.sqlContext.setConf("hive.mapred.supports.subdirectories", "true")
spark.sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive", "true")
spark.sqlContext.setConf("hive.hadoop.supports.splittable.combineinputformat", "true")
    spark.sqlContext.setConf("hive.exec.compress.output", "false")
    spark.sqlContext.setConf("hive.input.format", "org.apache.hadoop.hive.ql.io.CombineHiveInputFormat")
    spark.sqlContext.setConf("hive.merge.mapfiles", "true")
    spark.sqlContext.setConf("hive.merge.mapredfiles", "true")
    spark.sqlContext.setConf("hive.merge.size.per.task", "128000000")
    spark.sqlContext.setConf("hive.merge.smallfiles.avgsize", "128000000")
    spark.sqlContext.setConf("hive.groupby.skewindata", "true")
    spark.sqlContext.setConf("hive.merge.sparkfiles", "true")
    spark.sqlContext.setConf("hive.merge.mapfiles", "true")

 val df = spark.read.format("csv")
      .option("header", "false").load(path)
df.write.format("csv").saveAsTable("test_table")

score 0 · Answer 1 · answered Sep 08 '19 at 12:38

0

You can either estimate or calculate the size of the dataframe as described in that post How to find spark RDD/Dataframe size?
And then do a

val nParitions = (sizeInMB/128).ceil
df.repartition(nPartitions).write.format(....).saveAsTable(...)```

answered Sep 08 '19 at 12:38

Paul

1,114
8
11

Hi Paul, thanks for your information. It's helpful. But I still would like to why the configuration can't work well in Hive on Spark. – avseq Sep 08 '19 at 15:12
I'm not entirely sure but it seems that spark does not support this hive option. In this ticket https://issues.apache.org/jira/browse/SPARK-16188 they say the recommended way in spark is using coalesce or repartition. Not sure if this has changed since then – Paul Sep 08 '19 at 16:28
Paul, thanks for your information. It's helpful. I will use reparation or coalesce to implement it. – avseq Sep 10 '19 at 01:46

Use hive on spark to merge small file

1 Answers1