scala spark raises an error related to derby everytime when doing toDF() or createDataFrame

Question

I am new to scala and scala-api spark and I tried scala-api spark recently on my own computer, which means I run the spark locally by setting SparkSession.builder().master("local[*]"). at first I succeeded in reading the text file using spark.sparkContext.textFile(). After having got the corresponding rdd, I tried convert the rdd to a spark DataFrame, but failed again and again. To be specific, I used two methods, 1) toDF() and 2) spark.createDataFrame(), all failed, both two methods gave me similar error as shown below.

2018-10-16 21:14:27 ERROR Schema:125 - Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader 
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@199549a5, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)

I examined the error message, it seems that the errors are related to apache.derby and some connection to some database is failed. I do not know what JDBC is actually. I am somewhat familiar with pyspark and I have never been asked to configure any JDBC database, WHY SCALA-API SPARK need it? what should I do to avoid this error? why scala-api spark dataframe need JDBC or any database while scala-api spark RDD doesn't?

When it says "see the next exception for details", do this to learn what's really going on: https://wiki.apache.org/db-derby/UnwindExceptionChain — Bryan Pendleton, Oct 16 '18 at 20:09

score 1 · Answer 1 · answered Oct 17 '18 at 08:44

For future googler: I have googled for several hours and still have no idea about how to get rid of this error. But the origin of this problem is very clear: my sparksession enables the support for Hive which then need to specify the database. To solve this problem, we need to disable the support for Hive, since I am running spark on my own mac, it is ok to do this. So I download the spark source file and build it by myself using the command ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests omits -Phive -Phive-thriftserver. I tested self-built spark, and metastore_db folder has never been created and so fat so good.

For the detail, please refer to this post: Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

scala spark raises an error related to derby everytime when doing toDF() or createDataFrame

1 Answers1