How to create an EXTERNAL Spark table from data in HDFS

Question

I have loaded a parquet table from HDFS into a DataFrame:

val df = spark.read.parquet("hdfs://user/zeppelin/my_table")

I now want to expose this table to Spark SQL but this must be a persitent table because I want to access it from a JDBC connection or other Spark Sessions.

Quick way could be to call df.write.saveAsTable method, but in this case it will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore, creating another copy of the data in HDFS.

I don't want to have two copies of the same data, so I would want create like an external table to point to existing data.

score 3 · Answer 1 · answered Apr 26 '18 at 18:46

To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:

df.write.
  option("path","hdfs://user/zeppelin/my_mytable").
  saveAsTable("my_table")

The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.. This looks like a bug in Spark API...

Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE in a persistent table.

If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:

// Create a sample of the table 
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
    option("path", "/user/zeppelin/my_table_tmp").
    saveAsTable("my_table_tmp")

// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)

You are going to get a DDL like:

CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
  `serialization.format` '1',
  path 'hdfs:///user/zeppelin/my_table_tmp')

Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:

spark.sql("""
  CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
  USING parquet
  OPTIONS (
    `serialization.format` '1',
    path 'hdfs:///user/zeppelin/my_table')""")

When I tried to `specify "path" option` (in `Overwrite` mode), I didn't get any `SparkException`. But the schema of resulting `Hive` `EXTERNAL` table was incorrect: it collapsed all columns into `col array` — y2k-shubham, Nov 14 '18 at 07:52
@y2k-shubham are you using parquet format or csv? If you are using a CSV it is probably a problem with the field separator — Gabriel Avellaneda, Nov 15 '18 at 14:40
**@Gabriel Avellaneda** I was outputting it in `ORC` format. Looks like it [wasn't just me](https://stackoverflow.com/questions/49201436/). By the way, the straightforward solution was to run a `Hive` `DROP TABLE..` command followed by `CREATE EXTERNAL TABLE..` with proper schema. — y2k-shubham, Nov 16 '18 at 06:15

How to create an EXTERNAL Spark table from data in HDFS

1 Answers1