4

I'm using Pyspark from a Jupyter notebook and attempting to write a large parquet dataset to S3.
I get a 'no space left on device' error. I searched around and learned that it's because /tmp is filling up.
I want to now edit spark.local.dir to point to a directory that has space.
How can I set this parameter?
Most solutions I found suggested setting it when using spark-submit. However, I am not using spark-submit and just running it as a script from Jupyter.

Edit: I'm using Sparkmagic to work with an EMR backend.I think spark.local.dir needs to be set in the config JSON, but I am not sure how to specify it there.
I tried adding it in session_configs but it didn't work.

c3p0
  • 125
  • 2
  • 11

1 Answers1

2

The answer depends on where your SparkContext comes from.

If you are starting Jupyter with pyspark:

PYSPARK_DRIVER_PYTHON='jupyter'\
PYSPARK_DRIVER_PYTHON_OPTS="notebook" \
PYSPARK_PYTHON="python" \
pyspark

then your SparkContext is already initialized when you receive your Python kernel in Jupyter. You should therefore pass a parameter to pyspark (at the end of the command above): --conf spark.local.dir=...

If you are constructing a SparkContext in Python

If you have code in your notebook like:

import pyspark
sc = pyspark.SparkContext()

then you can configure the Spark context before creating it:

import pyspark
conf = pyspark.SparkConf()
conf.set('spark.local.dir', '...')
sc = pyspark.SparkContext(conf=conf)

Configuring Spark from the command line:

It's also possible to configure Spark by editing a configuration file in bash. The file you want to edit is ${SPARK_HOME}/conf/spark-defaults.conf. You can append to it as follows (creating it if it doesn't exist):

echo 'spark.local.dir /foo/bar' >> ${SPARK_HOME}/conf/spark-defaults.conf
Tim
  • 3,675
  • 12
  • 25
  • Sorry, should have clarified. I'm using Sparkmagic to connect to an EMR cluster. I'll update the question. – c3p0 Jun 29 '18 at 06:39
  • setting `PYSPARK_DRIVER_PYTHON='jupyter'` is a really **bad** practice - see [here](https://stackoverflow.com/questions/47824131/configuring-spark-to-work-with-jupyter-notebook-and-anaconda/47870277#47870277) for the proper way to use Jupyter with Pyspark – desertnaut Jun 30 '18 at 10:27
  • I tried setting the configuration with conf.set('spark.local.dir','/mymountedspace'), but it is throwing an error. ERROR:root:Exception while sending command. py4j.protocol.Py4JNetworkError: Answer from Java side is empty Py4JError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext – Raj006 Jan 27 '19 at 01:26
  • Nevermind the error. It was due to a permissions issue. Once I changed the ownership to my account, it worked. – Raj006 Jan 27 '19 at 05:29