25

When running spark-shell it creates a file derby.log and a folder metastore_db. How do I configure spark to put these somewhere else?

For derby log I've tried Getting rid of derby.log like so spark-shell --driver-memory 10g --conf "-spark.driver.extraJavaOptions=Dderby.stream.info.file=/dev/null" with a couple of different properties but spark ignores them.

Does anyone know how to get rid of these or specify a default directory for them?

Community
  • 1
  • 1
Carlos Bribiescas
  • 4,197
  • 9
  • 35
  • 66
  • You typed `derby.stream.info.file` in your question's text. The question you linked to, http://stackoverflow.com/questions/1004327/getting-rid-of-derby-log says to configure `derby.stream.error.file`. Which one did you actually try? – Bryan Pendleton Nov 09 '16 at 11:23

7 Answers7

17

The use of the hive.metastore.warehouse.dir is deprecated since Spark 2.0.0, see the docs.

As hinted by this answer, the real culprit for both the metastore_db directory and the derby.log file being created in every working subdirectory is the derby.system.home property defaulting to ..

Thus, a default location for both can be specified by adding the following line to spark-defaults.conf:

spark.driver.extraJavaOptions -Dderby.system.home=/tmp/derby

where /tmp/derby can be replaced by the directory of your choice.

Community
  • 1
  • 1
hiryu
  • 1,308
  • 1
  • 14
  • 21
  • 3
    This doesn't seem to work anymore. Using spark 2.2 :( – Moriarty Snarly Nov 17 '17 at 14:51
  • 1
    I tested on a new installation of Spark 2.2.1 and it's working for me. maybe it has to do with permissions? – hiryu Jan 09 '18 at 13:53
  • I tried with `spark.config("spark.driver.extraJavaOptions", "-Dderby.system.home=D:\\tmp\\derby")` using Spark 2.2.0 and it didn't work. – Adrien Brunelat Apr 01 '20 at 16:46
  • it seems that you are trying to change the configuration _after_ launching the Spark context. But by then it's too late for this setting... You need to change the Spark default configuration in the `spark-defaults.conf` file as explained above... – hiryu Apr 02 '20 at 12:42
16

For spark-shell, to avoid having the metastore_db directory and avoid doing it in the code (since the context/session is already created and you won't stop them and recreate them with the new configuration each time), you have to set its location in hive-site.xml file and copy this file into spark conf directory.
A sample hive-site.xml file to make the location of metastore_db in /tmp (refer to my answer here):

<configuration>
   <property>
     <name>javax.jdo.option.ConnectionURL</name>
     <value>jdbc:derby:;databaseName=/tmp/metastore_db;create=true</value>
     <description>JDBC connect string for a JDBC metastore</description>
   </property>
   <property>
     <name>javax.jdo.option.ConnectionDriverName</name>
     <value>org.apache.derby.jdbc.EmbeddedDriver</value>
     <description>Driver class name for a JDBC metastore</description>
   </property>
   <property>
      <name>hive.metastore.warehouse.dir</name>
      <value>/tmp/</value>
      <description>location of default database for the warehouse</description>
   </property>
</configuration>

After that you could start your spark-shell as the following to get rid of derby.log as well

$ spark-shell --conf "spark.driver.extraJavaOptions=-Dderby.stream.error.file=/tmp"
Community
  • 1
  • 1
user1314742
  • 2,865
  • 3
  • 28
  • 34
  • How to disable derby entirely in standalone. A lot of the old methods (postgres setup + db creation + hive-site.xml appear to be no longer working on spark 2.2) – mathtick Feb 27 '18 at 22:31
4

Try setting derby.system.home to some other directory as a system property before firing up the spark shell. Derby will create new databases there. The default value for this property is .

Reference: https://db.apache.org/derby/integrate/plugin_help/properties.html

BillRobertson42
  • 12,602
  • 4
  • 40
  • 57
3

For me setting the Spark property didn't work, neither on the driver nor the executor. So searching for this issue, I ended up setting the property for my system instead with:

System.setProperty("derby.system.home", "D:\\tmp\\derby")

val spark: SparkSession = SparkSession.builder
    .appName("UT session")
    .master("local[*]")
    .enableHiveSupport
    .getOrCreate

[...]

And that finally got me rid of those annoying items.

Adrien Brunelat
  • 4,492
  • 4
  • 29
  • 42
2

Use hive.metastore.warehouse.dir property. From docs:

val spark = SparkSession
  .builder()
  .appName("Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

For derby log: Getting rid of derby.log could be the answer. In general create derby.properties file in your working directory with following content:

derby.stream.error.file=/path/to/desired/log/file
Community
  • 1
  • 1
1

In case if you are using Jupyter/Jupyterhub/Jupyterlab or just setting this conf parameter inside python, use the following will work:

from pyspark import SparkConf, SparkContext

conf = (SparkConf()
    .setMaster("local[*]")
    .set('spark.driver.extraJavaOptions','-Dderby.system.home=/tmp/derby')
   )

sc = SparkContext(conf = conf)
kennyut
  • 3,671
  • 28
  • 30
0

I used the below configuration for a pyspark project, i was able to setup sparkwarehouse db and derby db in sample path, so was able to avoid them setup in current directory.

from pyspark.sql import SparkSession
from os.path import abspath
location = abspath("C:\self\demo_dbx\data\spark-warehouse") #Path where you want to setup sparkwarehouse 
local_spark = SparkSession.builder \
                .master("local[*]") \
                .appName('Spark_Dbx_Session') \
                .config("spark.sql.warehouse.dir", location)\
                .config("spark.driver.extraJavaOptions", 
                     f"Dderby.system.home='{location}'")\
                .getOrCreate()
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 19 '23 at 01:08