Spark SQL saveAsTable is not compatible with Hive when partition is specified

Question

Kind of edge case, when saving parquet table in Spark SQL with partition,

#schema definitioin
final StructType schema = DataTypes.createStructType(Arrays.asList(
    DataTypes.createStructField("time", DataTypes.StringType, true),
    DataTypes.createStructField("accountId", DataTypes.StringType, true),
    ...

DataFrame df = hiveContext.read().schema(schema).json(stringJavaRDD);

df.coalesce(1)
    .write()
    .mode(SaveMode.Append)
    .format("parquet")
    .partitionBy("year")
    .saveAsTable("tblclick8partitioned");

Spark warns:

Persisting partitioned data source relation into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive

In Hive:

hive> describe tblclick8partitioned;
OK
col                     array<string>           from deserializer
Time taken: 0.04 seconds, Fetched: 1 row(s)

Obviously the schema is not correct - however if I use saveAsTable in Spark SQL without partition the table can be queried without problem.

Question is how can I make a parquet table in Spark SQL compatible with Hive with partition info?

data is stored into hdfs and metadata is stored into hive metastore — dunlu_98k, Aug 31 '16 at 02:14
Did you try to "register as temp table" then run SQL commands "CREATE TABLE" then "INSERT "? — Samson Scharfrichter, Aug 31 '16 at 06:58
Thank you Samson , not yet but isn`t saveAsTable doing what it suppose to ? — dunlu_98k, Aug 31 '16 at 07:20
Maybe it depends on which version of Spark you are using, e.g. *"its' not a bug, it's a feature"* vs. *"will be implemented someday"* — Samson Scharfrichter, Aug 31 '16 at 16:21

rys · Answer 1 · 2016-09-20T18:00:08.790

13

That's because DataFrame.saveAsTable creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable. An example from SPARK-14927 looks like this:

hc.sql("create external table tmp.partitiontest1(val string) partitioned by (year int)")

Seq(2012 -> "a", 2013 -> "b", 2014 -> "c").toDF("year", "val")
  .write
  .partitionBy("year")
  .mode(SaveMode.Append)
  .saveAsTable("tmp.partitiontest1")

edited Sep 20 '16 at 18:00

answered Sep 19 '16 at 22:48

rys

131
1
5

1

Hi rys, welcome to StackOverflow and thank you for your answer. As links can change over time, would you be able to edit your answer to quote the relevant portions of the solution here? Thanks! – Tim Malone Sep 19 '16 at 23:14

score 1 · Answer 2 · answered May 24 '17 at 05:38

1

A solution is to create the table with Hive and then save the data with ...partitionBy("year").insertInto("default.mytable").

In my experience, creating the table in Hive and then using ...partitionBy("year").saveAsTable("default.mytable") did not work. This is with Spark 1.6.2.

answered May 24 '17 at 05:38

foghorn

997
2
10
16

2

I think this won't work with spark 2.0.0, i get ```User class threw exception: org.apache.spark.sql.AnalysisException: insertInto() can't be used together with partitionBy(). Partition columns have already be defined for the table. It is not necessary to use partitionBy().;``` – WBC Feb 13 '18 at 22:10

Spark SQL saveAsTable is not compatible with Hive when partition is specified

2 Answers2

Linked

Related