Spark schema using bucketBy is NOT compatible with Hive

Question

I am using Spark version 2.3 to write and save dataframes using bucketBy.

The table gets created in Hive but not with the correct schema. I am not able to select any data from the Hive table.

(DF.write
   .format('orc')
   .bucketBy(20, 'col1')
   .sortBy("col2")
   .mode("overwrite")
   .saveAsTable('EMP.bucketed_table1'))

I am getting below message:

Persisting bucketed data source table emp.bucketed_table1 into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

The Hive Schema is being created as shown below:

hive> desc EMP.bucketed_table1;
OK
col                     array<string>           from deserializer

How to save and write dataframes to a Hive table that can be viewed later?

Try to create bucketed table in Hive, then insert into it from Spark instead of `saveAsTable` — serge_k, Sep 05 '18 at 10:59
Please link this question https://stackoverflow.com/questions/52799025/error-using-spark-save-does-not-support-bucketing-right-now — David Ahern, Mar 27 '19 at 14:02

jmng · Answer 1 · 2020-02-07T16:53:45.603

While Spark (in versions <= 2.4, at least) doesn't directly support Hive's bucketing format, as described here and here, it is possible to get Spark to output bucketed data that is readable by Hive, by using SparkSQL to load the data into Hive; following your example it would be something like:

//enable Hive support when creating/configuring the spark session
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

//register DF as view that can be used with SparkSQL
DF.createOrReplaceTempView("bucketed_df")

//create Hive table, can also be done manually on Hive
val createTableSQL = "CREATE TABLE bucketed_table1 (col1 int, col2 string) CLUSTERED BY col1 INTO 20 BUCKETS STORED AS PARQUET"
spark.sql(createTableSQL)

//load data from DF into Hive, output parquet files will be bucketed and readable by Hive
spark.sql("INSERT INTO bucketed_table1 SELECT * FROM bucketed_df")

thebluephantom · Accepted Answer · 2018-09-09T13:18:07.877

1

All other DF Writer methods allow subsequent selecting from those bucketed tables via Hive and Impala editors, except they are not Spark bucketed.

You need to select from bucketed via spark.read. ...

This should help: https://spark.apache.org/docs/latest/sql-programming-guide.html

The answer to your question is that it not currently possible to select via Hive or Impala from Spark bucketed tables.

edited Sep 09 '18 at 13:18

answered Sep 05 '18 at 10:40

thebluephantom

16,458
8
40
83

If we are creating our dataframe using hive tables in pyspark, how can we create bucketed tables though pyspark, which can be viewed later in hive? – vikrant rana Sep 07 '18 at 01:42
1

Well, my understanding is you cannot. But may be I am wrong. You can use spark.read for Spark, but from Hive or IMPALA editor - I took the message incompatible to mean exactly that and did not find a way to view. May be there is but the message is pretty explicit. May be a custom (de) serializer ... – thebluephantom Sep 07 '18 at 13:49
The only other option, imhu(nderstandjng), is to df.write.partitionBy to an already created Hive table with dynamic partition enabled. But that is a different approach. – thebluephantom Sep 07 '18 at 22:15
Yes, we can use spark.read(path location of file) or using sql context to get the data back in pyspark. Is there anyway to look the bucketed tables in hive, which are being created by pyspark using bucketBy option. – vikrant rana Sep 08 '18 at 14:37
I believe the question has been answered, as in currently not possible. – thebluephantom Sep 09 '18 at 13:19
There is a post in stackoverflow that bucketBy tables can be viewed later in hive. If we set the below properties. When doing spark-submit you need to add to your SparkSession two options: .config("hive.metastore.uris", "thrift://addres-to-your-master:9083") and .enableHiveSupport(). Otherwise the hive tables you created will not be visible. Here is the link. https://stackoverflow.com/questions/48459208/does-spark-know-the-partitioning-key-of-a-dataframe. Does anyone know if it works?? – vikrant rana Sep 17 '18 at 17:26
1

That’s interesting. I did not find that in latest dics. Will try later and if so withdraw answer. – thebluephantom Sep 17 '18 at 18:12
I am wondering if we ard talkjng at cross purposes. I can resd bucketed tables via Spark no sweat. From Hive shell or Impala I cannot – thebluephantom Sep 18 '18 at 10:23
Yes, i was also able to read it back in spark but was not able to view it in hive.. since schema was compatible.. – vikrant rana Sep 18 '18 at 12:06
so this is a neutral conversation – thebluephantom Sep 18 '18 at 12:23

Spark schema using bucketBy is NOT compatible with Hive

2 Answers2