1

I built the jupyter/pyspark-notebook Docker image. I installed geomesa_pyspark and tried to run the following example commands from the official guide.

import geomesa_pyspark

conf = geomesa_pyspark.configure(
    jars=['/usr/local/spark/jars/geomesa-accumulo-spark-runtime_2.11-2.0.0.jar'],
    packages=['geomesa_pyspark','pytz'],
    spark_home='/usr/local/spark/').\
    setAppName('MyTestApp')

conf.get('spark.master')

from pyspark.sql import SparkSession

spark = ( SparkSession
    .builder
    .config(conf=conf)
    .enableHiveSupport()
    .getOrCreate()
)

The same problem persists.

Exception                                 Traceback (most recent call last)
<ipython-input-4-eca73e557583> in <module>
     22 from pyspark.sql import SparkSession
     23 
---> 24 spark = ( SparkSession
     25     .builder
     26     .config(conf=conf)

/usr/local/spark/python/pyspark/sql/session.py in getOrCreate(self)
    226                             sparkConf.set(key, value)
    227                         # This SparkContext may be an existing one.
--> 228                         sc = SparkContext.getOrCreate(sparkConf)
    229                     # Do not update `SparkConf` for existing `SparkContext`, as it's shared
    230                     # by all sessions.

/usr/local/spark/python/pyspark/context.py in getOrCreate(cls, conf)
    382         with SparkContext._lock:
    383             if SparkContext._active_spark_context is None:
--> 384                 SparkContext(conf=conf or SparkConf())
    385             return SparkContext._active_spark_context
    386 

/usr/local/spark/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    142                 " is not allowed as it is a security risk.")
    143 
--> 144         SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
    145         try:
    146             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,

/usr/local/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
    329         with SparkContext._lock:
    330             if not SparkContext._gateway:
--> 331                 SparkContext._gateway = gateway or launch_gateway(conf)
    332                 SparkContext._jvm = SparkContext._gateway.jvm
    333 

/usr/local/spark/python/pyspark/java_gateway.py in launch_gateway(conf, popen_kwargs)
    106 
    107             if not os.path.isfile(conn_info_file):
--> 108                 raise Exception("Java gateway process exited before sending its port number")
    109 
    110             with open(conn_info_file, "rb") as info:

Exception: Java gateway process exited before sending its port number

In the logs on Portainer instead I read this exception:

Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
    at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:631)
    at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:271)
    at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:234)
    at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:119)
    at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.<init>(SparkSubmit.scala:1013)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1013)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I think the JAVA_HOME is set correctly (JAVA_HOME="/usr/lib/jvm/java-1.11.0-openjdk-amd64"). From the logs I guess that HADOOP_HOME should also be set, but despite the fact that the installed Hadoop version is shown in the Docker image specifications, when I run the container I can't find Hadoop anywhere. One way thing is that if I run the same commands from the pyspark shell instead of from the Notebook, it runs correctly without errors.

Luigi
  • 181
  • 3
  • 15

1 Answers1

1

You don't say what version of Spark and GeoMesa you installed, but at a minimum you will need to:

  1. Align the Spark version with a supported version
  2. Ensure that the Scala version is the same between GeoMesa and Spark (probably _2.12)
  3. Ensure that the jars path in you code is pointing to the correct GeoMesa jar for the version you installed
  4. Use JDK 8, as GeoMesa currently requires JDK 8.
Emilio Lahr-Vivaz
  • 1,439
  • 6
  • 5
  • I installed Java 8, I added hdfs to the Docker Compose stack and set the related environment variables. The rest I think is ok. A new error has arisen. Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.net.ConnectException: Call From 2bc1a5523504 / 172.20.0.3 to 0.0.0.0:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused – Luigi Sep 29 '22 at 10:54