0

Here is some context of my installation of pyspark binary.

In my company, we use a Cloudera Data Science Workbench (CDSW). When we create a session for a new projet, I'm guessing it's a image from a specific Dockerfile. And inside this dockerfile is pushed the installation of CDH binaries and configuration.

Now I wish to use thoses configurations outside CDSW. I have a kubernetes cluster where I deploy webapps. And I would like to use spark in Yarn mode to deploy very small ressources for the webapps.

What I have done, is to tar.gz all binaries and config from /opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072 and /var/lib/cdsw/client-config/. Then untar.gz them in a container or in a WSL2 instance.

Instead of unpacking everything in /var/ or /opt/ like I should do, I've put them in $HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/* and $USER/etc/client-config/*. Why I did this? Because I might want to use a mounted Volume in my kubernetes and share binaries between containers.

I've sed and modifiy all configuration files to adapt paths:

  • spark-env.sh
  • topology.py
  • Any *.txt, *.sh, *.py

So I managed to run beeline hadoop hdfs hbase pointing them with the hadoop-conf folder. I can use pyspark but in local mode only. But What I really want is to use pyspark with yarn.

So I set a bunch of env variables to make this work:

export HADOOP_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export SPARK_CONF_DIR=$HOME/etc/client-config/spark-conf/yarn-conf
export JAVA_HOME=/usr/local
export BIN_DIR=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/bin
export PATH=$BIN_DIR:$JAVA_HOME/bin:$PATH
export PYSPARK_PYTHON=python3.6
export PYSPARK_DRIVER_PYTHON=python3.6
export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1

export SPARK_HOME=/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark
export PYSPARK_ARCHIVES_PATH=$(ZIPS=("$CDH_DIR"/lib/spark/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYSPARK_ARCHIVES_PATH
export SPARK_DIST_CLASSPATH=$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/hadoop/client/accessors-smart-1.2.jar:<ALL OTHER JARS FOR EVERY BINARIES>

Anyway, all of the paths are existing and working. And since I've sed all config files, they also generate the same path as the exported one.

I launch my pyspark binary like this:

pyspark --conf "spark.master=yarn" --properties-file $HOME/etc/client-config/spark-conf/spark-defaults.conf --verbose

FYI, it is using pyspark 2.4.0. And I've install Java(TM) SE Runtime Environment (build 1.8.0_131-b11). The same that I found on the CDSW instance. I added the keystore with the public certificate of the company. And I also generate a keytab for the kerberos auth. Both of them are working since I can used hdfs with HADOOP_CONF_DIR=$HOME/etc/client-config/hadoop-conf

In verbose mode I can see all the details and configuration from spark. When I compare it from the CDSW session, they are quite identical (with modified path, for example :

Using properties file: /home/docker4sg/etc/client-config/spark-conf/spark-defaults.conf
Adding default property: spark.lineage.log.dir=/var/log/spark/lineage
Adding default property: spark.port.maxRetries=250
Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: spark.driver.log.persistToDfs.enabled=true
Adding default property: spark.yarn.jars=local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/jars/*,local:/home/docker4sg/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/hive/*
...

After few seconds it fails to create a sparkSession:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-22 14:44:14 WARN  Client:760 - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
2022-02-22 14:44:14 ERROR SparkContext:94 - Error initializing SparkContext.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
Caused by: java.net.URISyntaxException: Expected scheme-specific part at index 12: pyspark.zip:
...
2022-02-22 14:44:15 WARN  YarnSchedulerBackend$YarnSchedulerEndpoint:69 - Attempted to request executors before the AM has registered!
2022-02-22 14:44:15 WARN  MetricsSystem:69 - Stopping a MetricsSystem that is not running
2022-02-22 14:44:15 WARN  SparkContext:69 - Another SparkContext is being constructed (or threw an exception in its constructor).  This may indicate an error, since only one SparkContext may be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58

From what I understand, it fails for a reason I'm not sure about and then tries to fall back into an other mode. That fails too.

In the configuration file spark-conf/yarn-conf/yarn-site.xml, it is specified that it is using a zookeeper:

  <property>
    <name>yarn.resourcemanager.zk-address</name>
    <value>corporate.machine.node1.name.net:9999,corporate.machine.node2.name.net:9999,corporate.machine.node3.name.net:9999</value>
  </property>

Could it be that the Yarn cluster does not accept traffic from a random IP (kuber IP or personnal IP from computer)? For me, the IP i'm working on is not on the whitelist, but at the moment I cannot ask to add my ip to the whitelist. How can I know for sure I'm looking in the good direction?

Edit 1:

As said in the comment, the URI of the pyspark.zip was wrong. I've modified my PYSPARK_ARCHIVES_PATH to the real location of pyspark.zip.

PYSPARK_ARCHIVES_PATH=local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/py4j-0.10.7-src.zip,local:$HOME/opt/cloudera/parcels/CDH-6.3.4-1.cdh6.3.4.p4484.8795072/lib/spark/python/lib/pyspark.zip

Now I get an error UnknownHostException:

org.apache.spark.SparkException: Uncaught exception: org.apache.spark.SparkException: Exception thrown in awaitResult
...
Caused by: java.io.IOException: Failed to connect to <HOSTNAME>:13250
...
Caused by: java.net.UnknownHostException: <HOSTNAME>
...
BeGreen
  • 765
  • 1
  • 13
  • 39
  • You think too much. There's a Java `URISyntaxException` with an error message saying that some configuration property contains `pyspark.zip` or `pyspark.zip:` instead of e.g. `file:///tmp/pyspark.zip` or `hdfs:///user/dummy/pyspark.zip` or `https://service.name/path/spark.zip` -- a syntax error. Back to reality, find out how the containerized Cloudera Data SciFi Workbench manages that ZIP and how that translates in your local setup. – Samson Scharfrichter Feb 22 '22 at 18:40
  • @SamsonScharfrichter Indeed, the path of PYSPARK_ARCHIVES_PATH was wrong, pointing to a non existing folder. I've hardcoded `PYSPARK_ARCHIVES_PATH` instead of dynamically trying to create it. Now I get `java.net.UnknownHostException: $HOSTNAME` I'll update my post for proper formating. – BeGreen Feb 24 '22 at 10:24
  • Now you have a networking error that has nothing to do with Spark (apart from any configuration files you have)... `ping` and `nslookup` / `dig` would be used to troubleshoot – OneCricketeer Feb 24 '22 at 17:16
  • @OneCricketeer What I don't understand is that is my local hostname, not the host name of the zookeeper/Yarn – BeGreen Feb 25 '22 at 18:20
  • Where is the error coming from? The executors need to return data back to the driver (your local machine), so that might be the cause. Have you tried using `--deploy-mode=cluster`? – OneCricketeer Feb 25 '22 at 18:25
  • The error comes from the command `pyspark --conf "spark.master=yarn"` in verbose + specify conf file. I'll try asap. I can tell that pyspark can acces HDFS since its writting a hidden folder in its user directory. – BeGreen Feb 26 '22 at 10:44

0 Answers0