Spark partitions: creating RDD partitions but not Hive partitions

Question

This is a followup to Save Spark dataframe as dynamic partitioned table in Hive . I tried to use suggestions in the answers but couldn't make it to work in Spark 1.6.1

I am trying to create partitions programmatically from `DataFrame. Here is the relevant code (adapted from a Spark test):

hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
//    hc.setConf("hive.exec.dynamic.partition", "true")
//    hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")

Seq(2012 -> "a").toDF("year", "val")
  .write
  .partitionBy("year")
  .mode(SaveMode.Append)
  .saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show

Full file is here: https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a

Partitioned files are created fine on the file system but Hive complains that the table is not partitioned:

======================
HIVE FAILURE OUTPUT
======================
SET hive.support.sql11.reserved.keywords=false
SET hive.metastore.warehouse.dir=tmp/tests
OK
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a partitioned table
======================

It looks like the root cause is that org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable always creates table with empty partitions.

Any help to move this forward is appreciated.

EDIT: also created SPARK-14927

score 1 · Answer 1 · answered Apr 27 '16 at 00:29

I found a workaround: if you pre-create the table then saveAsTable() won't mess with it. So the following works:

hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
//    hc.setConf("hive.exec.dynamic.partition", "true")
//    hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
hc.sql("create database if not exists tmp")
hc.sql("drop table if exists tmp.partitiontest1")

// Added line:
hc.sql("create table tmp.partitiontest1(val string) partitioned by (year int)")   


Seq(2012 -> "a").toDF("year", "val")
  .write
  .partitionBy("year")
  .mode(SaveMode.Append)
  .saveAsTable("tmp.partitiontest1")
hc.sql("show partitions tmp.partitiontest1").show

This workaround works in 1.6.1 but not in 1.5.1

Spark partitions: creating RDD partitions but not Hive partitions

1 Answers1