I'm using spark-hive 2.3.0 with Scala 2.11 and setting up a unit test framework. spark-hive comes with TestHiveContext
and TestHiveSparkSession
which conveniently allow invoking Hive from unit tests without having Hadoop, Spark, or a cluster running, which is great for automated tests.
Hive needs a database for its metastore, and when run in this way it uses Derby as an embedded database configured with javax.jdo.option.ConnectionURL
which by default is jdbc:derby:;databaseName=<file-path>;create=true
. The <file-path>
is a location in the local filesystem and is one option for running Derby.
Another option is running Derby in-memory which is usually as easy as changing this URL to something like jdbc:derby:memory:databaseName;create=true
. Hoewever, this isn't possible with Hive because the configuration is made in an internal HiveUtils
class and can't be overridden. I've tried changing it in my Spark Session Builder but my change later gets blown away by HiveUtils
when I create my TestHiveContext
.
In my case an in-memory database is preferable because our developers run on Windows (definitely not my/our choice), and when these files are created there are often problems like permissions or invalid characters in the filename (since Hadoop was never really intended to work on Windows), and these files often get left behind because they can't be cleaned up (due to these problems). We would like the tests to be completely self-contained so they can be run and finished with no side effects, so they can be run in multiple environments (developer, CI, Jenkins, AWS, etc).
Interestingly I see this in TestHive.scala
:
{ // set the metastore temporary configuration
val metastoreTempConf = HiveUtils.newTemporaryConfiguration(useInMemoryDerby = false) ++ Map(
So there is a flag for using an in-memory database, but this is not configurable and there is no code path where this is set to true
.
Is there any way to configure or write this so that TestHive
's Derby can be in-memory? Trying to set the value of javax.jdo.option.ConnectionURL
through either hive-site.xml or hdfs-site.xml does not work, I think it is because of how TestHive
, TestHiveContext
and TestHiveSparkSession
are initialized, they have their own code paths separate from the non-test paths. The functionality they provide is very helpful for a test framework but apparently doesn't provide a way to override this value and some other settings.
The best option I can see so far is to override or write my own TestHiveContext
class that borrows a bunch of functionality from that class and overrides the parts that I need, but that's a relatively large undertaking for what I think could be done with a simple configuration change.