How to use in-memory Derby database for testing with Hive (Scala)

Question

I'm using spark-hive 2.3.0 with Scala 2.11 and setting up a unit test framework. spark-hive comes with TestHiveContext and TestHiveSparkSession which conveniently allow invoking Hive from unit tests without having Hadoop, Spark, or a cluster running, which is great for automated tests.

Hive needs a database for its metastore, and when run in this way it uses Derby as an embedded database configured with javax.jdo.option.ConnectionURL which by default is jdbc:derby:;databaseName=<file-path>;create=true. The <file-path> is a location in the local filesystem and is one option for running Derby.

Another option is running Derby in-memory which is usually as easy as changing this URL to something like jdbc:derby:memory:databaseName;create=true. Hoewever, this isn't possible with Hive because the configuration is made in an internal HiveUtils class and can't be overridden. I've tried changing it in my Spark Session Builder but my change later gets blown away by HiveUtils when I create my TestHiveContext.

In my case an in-memory database is preferable because our developers run on Windows (definitely not my/our choice), and when these files are created there are often problems like permissions or invalid characters in the filename (since Hadoop was never really intended to work on Windows), and these files often get left behind because they can't be cleaned up (due to these problems). We would like the tests to be completely self-contained so they can be run and finished with no side effects, so they can be run in multiple environments (developer, CI, Jenkins, AWS, etc).

Interestingly I see this in TestHive.scala:

{ // set the metastore temporary configuration
  val metastoreTempConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false) ++ Map(

So there is a flag for using an in-memory database, but this is not configurable and there is no code path where this is set to true.

Is there any way to configure or write this so that TestHive's Derby can be in-memory? Trying to set the value of javax.jdo.option.ConnectionURL through either hive-site.xml or hdfs-site.xml does not work, I think it is because of how TestHive, TestHiveContext and TestHiveSparkSession are initialized, they have their own code paths separate from the non-test paths. The functionality they provide is very helpful for a test framework but apparently doesn't provide a way to override this value and some other settings.

The best option I can see so far is to override or write my own TestHiveContext class that borrows a bunch of functionality from that class and overrides the parts that I need, but that's a relatively large undertaking for what I think could be done with a simple configuration change.

What about using the standard way to configure Hive (or any other component of the Hadoop ecosystem that uses Hadoop configuration libs) i.e. creating a `hive-site.xml` config file, and adding the _directory_ containing that file to the CLASSPATH? — Samson Scharfrichter, Apr 06 '18 at 20:21
For the record, when using the shell launchers e.g. `spark-submit`, the scripts add `$SPARK_CONF_DIR` to the CLASSPATH which means it's the right place to have Log4J and Hive config files. Among other things. — Samson Scharfrichter, Apr 06 '18 at 20:23
See the 2nd answer to https://stackoverflow.com/questions/38377188/how-to-get-rid-of-derby-log-metastore-db-from-spark-shell — Samson Scharfrichter, Apr 06 '18 at 20:26
I tried configuring this parameter with hive-site.xml or hdfs-site.xml but that did not work, I added an edit to the original post with details. — Uncle Long Hair, Apr 09 '18 at 14:59

score 6 · Answer 1 · edited Jan 22 '19 at 16:30

I finally figured out how to do this, and wanted to share the answer in case someone else is trying to do the same thing.

My test class uses the SharedSparkContext trait, which provides a SparkContext reference by via var sc.

After the SparkContext is initialized (I used the beforeAll hook available in scalatest test frameworks), I create a TestHiveContext like this:

hc = new TestHiveContext(sc, false)

And then immediately afterwards, I can set the javax.jdo.option.ConnectionURL and presumably some other Hadoop and Hive configurations like this:

sc.hadoopConfiguration.set("javax.jdo.option.ConnectionURL", 
                           "jdbc:derby:memory:db;create=true")

This config param is used by Hive, but apparently has to be added to the Hadoop configuration, which is used to build the Hive test context.

The trick is the timing, this has to be done after Hadoop and Hive have initialized themselves (using config files and whatnot), and the scalatest framework is also initialized, and finally after the TestHive framework is initialized, but before you have run any tests. Trying to set this parameter before these other initializations means your setting will be overwritten before your tests run.

How to use in-memory Derby database for testing with Hive (Scala)

1 Answers1