To create a Spark External table you must specify the "path" option of the DataFrameWriter. Something like this:
df.write.
option("path","hdfs://user/zeppelin/my_mytable").
saveAsTable("my_table")
The problem though, is that it will empty your hdfs path hdfs://user/zeppelin/my_mytable
eliminating your existing files and will cause an org.apache.spark.SparkException: Job aborted.
. This looks like a bug in Spark API...
Anyway, the workaround to this (tested in Spark 2.3) is to create an external table but from a Spark DDL. If your table have many columns creating the DDL could be a hassle. Fortunately, starting from Spark 2.0, you could call the DDL SHOW CREATE TABLE
to let spark do the hard work. The problem is that you can actually run the SHOW CREATE TABLE
in a persistent table.
If the table is pretty big, I recommend to get a sample of the table, persist it to another location, and then get the DDL. Something like this:
// Create a sample of the table
val df = spark.read.parquet("hdfs://user/zeppelin/my_table")
df.limit(1).write.
option("path", "/user/zeppelin/my_table_tmp").
saveAsTable("my_table_tmp")
// Now get the DDL, do not truncate output
spark.sql("SHOW CREATE TABLE my_table_tmp").show(1, false)
You are going to get a DDL like:
CREATE TABLE `my_table_tmp` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table_tmp')
Which you would want to change to have the original name of the table and the path to the original data. You can now run the following to create the Spark External table pointing to your existing HDFS data:
spark.sql("""
CREATE TABLE `my_table` (`ID` INT, `Descr` STRING)
USING parquet
OPTIONS (
`serialization.format` '1',
path 'hdfs:///user/zeppelin/my_table')""")