1

I am new to spark and relatively new to Linux in general. I am running Spark on local Ubuntu in client mode. I have RAM of 16 GB. I installed apache spark following this link. And I am able to run and process large volume of data. The challenge is exporting the resulting data frames in csv. With even 100k rows of data I am getting all sorts of memory issues. On contrast I was able to process partitioned python files of totaling several millions of rows.

Based on lots of googling, I believe the problem lies with my spark.driver.memory. I need to change this but since I am running on client mode I should change it in some configuration file. How can I locate if I have an existing Spark configuration file or how do I create a new one and set spark.driver.memory to 2GB.enter image description here

itthrill
  • 1,241
  • 2
  • 17
  • 36

1 Answers1

0

You can change the default value for all sessions in

$SPARK_HOME/spark-defaults.conf

If you do not find spark-defaults.conf you should have a file spark-defaults.conf.template, just cp spark-defaults.conf.template spark-defaults.conf and edit it uncommenting the line:

# spark.driver.memory              5g

Alternatively, you can set the value just for the current session using .config in the session builder:

spark = SparkSession.builder \
       .master("local[*]") \
       .appName("myApp") \
       .config("spark.driver.memory", "5g") \
       .getOrCreate()

(perhaps you might also want to increase spark.executor.memory)

See also my other answer to a similar question.

user2314737
  • 27,088
  • 20
  • 102
  • 114
  • I could not find any value in $SPARK_HOME for ex:- (py_env) user@laptop:~$ echo "$SPARK_HOME" returns nothing. Q1) Can I export a path to $SPARK_HOME? – itthrill May 08 '22 at 14:34
  • Q2) Does it need to be a specific path? Q3) I could not find spark-defaults.conf.template or spark-defaults.conf anywhere in machine. Can I download sample spark-defaults.conf.template from web? – itthrill May 08 '22 at 14:36
  • (py_env) user@laptop:~$ find -name spark ./anaconda3/envs/py_env/lib/python3.9/site-packages/pyspark/pandas/spark ./anaconda3/pkgs/pyspark-3.2.1-pyhd8ed1ab_0/site-packages/pyspark/pandas/spark – itthrill May 08 '22 at 14:38
  • try a global search `find / -name "spark-defaults*" 2>/dev/null` (it might take long) – user2314737 May 08 '22 at 14:54
  • :( ran the above command; It ran for a while and then returns nothing. – itthrill May 08 '22 at 15:00
  • How are you running Spark jobs? In a spark-shell? In Python with `import pyspark`? Please provide a [mcve] (as much as possible) – user2314737 May 08 '22 at 15:02
  • I installed pyspark using anaconda and not downloading directly from apache site. I just removed pyspark packages from anaconda and installed from apache site. I hope I can figure the rest out as all the suggestions are based with assumption that pyspark is installed by downloading it from apache site. – itthrill May 08 '22 at 15:47