0

I am using pyspark on Jupyter Notebook with Spark 2.1.1 cluster running on IP "spark://remote:port" (spark master IP) I am able to create a SparkContext successfully.

But, I want to read spark_master_ip and spark.cores.max from a .properties file (instead of hard coding it). When I try to read my custom spark properties file in 'myspark_config.properties' file (which I parse and read successfully), but I get the following Java gateway exception, when I try to create SparkContext(). Here is my code:

import pyspark
from pprint import pprint
from pyspark import SparkConf
def getproperties():
    """Get Spark configuration properties in python dictionary"""
    global properties
    properties = dict()
    with open('myspark_config.properties') as f:
        for line in f:
                if not line.startswith('#') and not line.startswith('\n'):
                    tokens = line.split('=')
                    tokens[0] = tokens[0].strip()
                    tokens[1] = "=".join(tokens[1:])
                    properties[tokens[0]] = tokens[1].strip()

        f.close()
    pprint(properties)
    return(properties)
properties = getproperties()

conf = (SparkConf()
            .setMaster(properties["spark_master_url"])
            .setAppName("testApp")
            .set('spark.cores.max',properties["spark_app_cores"])
            .set('spark.executor.memory',properties["spark_app_memory"])
            .set('spark.dynamicAllocation.enabled','true')
            .set('spark.shuffle.service.enabled','true')

            )
# conf = (SparkConf()
#             .setMaster("spark://remote:port")
#             .setAppName("testApp")
#             .set('spark.cores.max',"2")
#             .set('spark.executor.memory',"2G")
#             .set('spark.dynamicAllocation.enabled','true')
#             .set('spark.shuffle.service.enabled','true')
#         )
sc = pyspark.SparkContext(conf=conf)

I do not get any exceptions and my code runs smoothly if I don't read from file and hard code the spark master in SparkConf() (currently it is commented). "JAVA_HOME","SPARK_HOME" "PYTHONPATH" are set successfully and I am not using Anaconda. I am using Jupyter notebook on python 2.7 on ubuntu and spark-2.1.1

{'spark_app_cores': '"2"',
 'spark_app_memory': '"2G"',
 'spark_master_url': '"spark://remote:port"'}

Exception                                 Traceback (most recent call last)
<ipython-input-1-c893eaf079f2> in <module>()
     36 #             .set('spark.shuffle.service.enabled','true')
     37 #         )
---> 38 sc = pyspark.SparkContext(conf=conf)

/usr/local/spark/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    113         """
    114         self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    116         try:
    117             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    281         with SparkContext._lock:
    282             if not SparkContext._gateway:
--> 283                 SparkContext._gateway = gateway or launch_gateway(conf)
    284                 SparkContext._jvm = SparkContext._gateway.jvm
    285 

/usr/local/spark/python/pyspark/java_gateway.py in launch_gateway(conf)
     93                 callback_socket.close()
     94         if gateway_port is None:
---> 95             raise Exception("Java gateway process exited before sending the driver its port number")
     96 
     97         # In Windows, ensure the Java child processes do not linger after Python has exited.

Exception: Java gateway process exited before sending the driver its port number

I have looked through various links but found no solution: pyspark java gateway error stackoverflow Github issue on java Gateway

Sangram Gaikwad
  • 764
  • 11
  • 21

1 Answers1

0

But, I want to read spark_master_ip and spark.cores.max from a .properties file (instead of hard coding it).

That is awesome idea, but you're ignoring the fact, that this is what $SPARK_HOME/conf/spark-defaults.conf is for. Just put required properties there.

but I get the following Java gateway exception,

This looks wrong:

"=".join(tokens[1:])

Why would use = in properties?

Otherwise it should have no effect. Also Python provides property parser https://docs.python.org/3/library/configparser.html

  • I need my own properties file to set properties specific to my application (like api_key to call a REST API and table_names of databases). Example api_key = "shajkhsakj=sdi23" this is one entry in properties file. I am not allowed to use third party libraries like configparser or jproperties. But my main issue is why do I get an error if I try to open a file in python with open() command, before initializing SparkContext(). – Sangram Gaikwad Oct 11 '17 at 12:03