3

I am having trouble with reading a spark dataframe from a hive table. I stored the dataframe as:

dataframe.coalesce(n_files).write.option("mergeSchema", "true").mode("overwrite").parquet(table_path)

When I try to read this dataframe and do a .show() on it, it breaks with the following error:

java.lang.UnsupportedOperationException: parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
at parquet.column.Dictionary.decodeToLong(Dictionary.java:52)

How I can I find which column is the root cause of this error? I tried to follow the answer here . But I am able to load the df perfectly fine reading the parquet files viz:

df = spark.read.option("mergeSchema", "True").parquet("/hdfs path to parquets")
  • The said hive table is an external table. My guess is it has something to do with table properties? But what should I be looking at?
  • I cannot use saveAsTable. I need to write directly to the path due to a certain requirement
Ayush Goyal
  • 361
  • 4
  • 15

1 Answers1

10

Found the root cause of my problem. Posting my findings here so someone in need can check if their case is the same.

I encountered this issue because of the difference in datatypes in hive table metadata and the parquets. The thing is when you do a saveAsTable spark will typecast your data while saving, if there is any difference. But when you do a df.write.parquet(path) you are writing your parquets directly to the path, so if there is a mismatch between the table metadata and the parquets, df.show will throw an error.

For example, if your table metadata has dtype 'bigint' for column A, but the df you're trying to save has dtype IntegerType for the same column (instead of LongType, which is the correct interpretation for bigint), a saveAsTable would typecast IntegerType to LongType but df.write.parquet(path) won't.

The solution is to typecast your problematic column to the dtype that matches with the metadata of the table.

Ayush Goyal
  • 361
  • 4
  • 15