What are SparkSession Config Options

Question

I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.

 val spark = SparkSession
   .builder()
   .appName("jsonReaderApp")
   .config("config.key.here", configValueHere)
   .enableHiveSupport()
   .getOrCreate()
val jread = spark.read.json("search-results1.json")

I am very new to spark and do not know what to use for config.key.here and configValueHere.

Look at this.., https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html — ROOT, Mar 26 '17 at 04:44

Clay · Accepted Answer · 2023-03-28T13:22:09.973

SparkSession

To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).

import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()

or without importing SparkConf:

spark.sparkContext.getConf().getAll()

Depending on which API you are using, see one of the following:

You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.

spark.sparkContext._conf.getAll()

SparkContext

To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.

import pyspark
from pyspark import SparkConf, SparkContext 
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()

Depending on which API you are using, see one of the following:

Spark parameters

You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
 ...
 ...
 (u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]

Depending on which API you are using, see one of the following:

https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/3.3.2/api/R/reference/sparkR.conf.html (for SparkR, sparkConfig can only be set from sparkR.session(sparkConfig=list()))

For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties

Setting Spark parameters

Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:

SparkSession

spark = (
    SparkSession
    .builder
    .appName("Your App Name")
    .config("spark.some.config.option1", "some-value")
    .config("spark.some.config.option2", "some-value")
    .getOrCreate())

sc = spark.sparkContext

SparkContext

spark_conf = (
    SparkConf()
    .setAppName("Your App Name")
    .set("spark.some.config.option1", "some-value")
    .set("spark.some.config.option2", "some-value"))

sc = SparkContext(conf = spark_conf)

spark-defaults

You can also set the Spark parameters in a spark-defaults.conf file:

spark.some.config.option1 some-value
spark.some.config.option2 "some-value"

then run your Spark application with spark-submit (pyspark):

spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py

For SparkSession, it seems `spark.sparkContext.getConf().getAll()` provides more info than `SparkConf().getAll()`. — flow2k, Jul 03 '19 at 18:55

score 7 · Answer 2 · edited May 10 '19 at 21:53

This is how it worked for me to add spark or hive settings in my scala:

{
    val spark = SparkSession
        .builder()
        .appName("StructStreaming")
        .master("yarn")
        .config("hive.merge.mapfiles", "false")
        .config("hive.merge.tezfiles", "false")
        .config("parquet.enable.summary-metadata", "false")
        .config("spark.sql.parquet.mergeSchema","false")
        .config("hive.merge.smallfiles.avgsize", "160000000")
        .enableHiveSupport()
        .config("hive.exec.dynamic.partition", "true")
        .config("hive.exec.dynamic.partition.mode", "nonstrict")
        .config("spark.sql.orc.impl", "native")
        .config("spark.sql.parquet.binaryAsString","true")
        .config("spark.sql.parquet.writeLegacyFormat","true")
        //.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
        .getOrCreate()
}

its harcoded do you know how to pass though a file ? – sri hari kali charan Tummala Jan 28 '20 at 01:07 — sri hari kali charan Tummala, Jan 28 '20 at 01:07

score 4 · Answer 3 · answered Jan 15 '21 at 19:57

The easiest way to set some config:

spark.conf.set("spark.sql.shuffle.partitions", 500).

Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.

score 1 · Answer 4 · answered Mar 26 '17 at 04:52

1

In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.

for eg : you can refer to https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option

To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html

answered Mar 26 '17 at 04:52

Sriram

53
7

Thank you very much. But my question is mainly about setting up the SparkSession to read json files. – Sha2b Mar 26 '17 at 17:54
Ok, here is the setup I used, `val spark = SparkSession .builder() .appName("jsonReaderApp") .config("spark.sql.json.rdd2", 2) .getOrCreate()` and `val jread = spark.read.json("bin/flatjson.json")` The only issue is I cannot use the JSON data I get from the stream. I have to change the formatting of the file; each line must be one object and there should be no commas between objects. – Sha2b Mar 27 '17 at 02:16

score 0 · Answer 5 · answered Mar 27 '17 at 17:26

0

Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html

You can set these at run-time as in your example above or through the config file given to spark-submit

answered Mar 27 '17 at 17:26

Anil

84
6

What are SparkSession Config Options

5 Answers5

SparkSession

SparkContext

Spark parameters

Setting Spark parameters

SparkSession

SparkContext

spark-defaults

Linked