1

I created a Spark SQL table by calling .saveAsTable on my dataframe. That command succeeded completely. However, now when I query the table, the parquet files seem corrupt. I'm seeing this error:

"Failed with exception java.io.IOException:java.io.IOException: hdfs://ip:8020/user/hive/warehouse/people/part-r-00001.parquet not a SequenceFile"

below steps I have followed in spark-shell

scala >val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala>val path="test.json"
scala>val people = sqlContext.jsonFile(path)
scala> people.saveAsTable("people")

after that I have opened hive command prompt

hive> select * from people;
OK Failed with exception java.io.IOException:java.io.IOException: hdfs://IP:8020/user/hive/warehouse/people/part-r-00001.parquet not a SequenceFile Time taken: 0.276 seconds

How can I get my hive table(people) result.

Please let me know anything, nay change configuration wise.

How can I resolve above exception.

Thanks in advance.

Sai
  • 1,075
  • 5
  • 31
  • 58
  • try setting `spark.sql.hive.convertMetastoreParquet` to false – Sebastian Piu Jan 19 '16 at 19:36
  • Hi Sebastian, Thanks for reply – Sai Jan 20 '16 at 07:23
  • Hi Sebastian, Thanks for reply.I have done required change based on your suggestion i.e I have add spark.sql.hive.convertMetastoreParquet to false in my "spark-defaults.conf" like (spark.sql.hive.convertMetastoreParquet false) and then I have restarted my cluster. But still i am getting same error. can you please help on this. – Sai Jan 20 '16 at 07:28

2 Answers2

1

This may relate to https://issues.apache.org/jira/browse/SPARK-14927 .

Seems saveAsTable will create a Hive table with spark-specific format. If you can see some messages like

Persisting partitioned data source relation `XX Table` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s)

then the spark-specific format probably is the cause.

You can create the hive table first with sqlContext.sql('create table XXX'). And then put your data in HDFS with df.write.save.

Also see this question , this and this blog

phil
  • 2,558
  • 1
  • 19
  • 23
0

Tables created with saveAsTable won't work from hive if Hive and Spark are using different Parquet SerDe versions, you can try using a different serialisation method

E.g.:

df.write().format("orc").saveAsTable("table") or df.write().format("json").saveAsTable("table")

Sebastian Piu
  • 7,838
  • 1
  • 32
  • 50