22

I have a Spark cluster setup with one master and 3 workers. I also have Spark installed on a CentOS VM. I'm trying to run a Spark shell from my local VM which would connect to the master, and allow me to execute simple Scala code. So, here is the command I run on my local VM:

bin/spark-shell --master spark://spark01:7077

The shell runs to the point where I can enter Scala code. It says that executors have been granted (x3 - one for each worker). If I peek at the Master's UI, I can see one running application, Spark shell. All the workers are ALIVE, have 2 / 2 cores used, and have allocated 512 MB (out of 5 GB) to the application. So, I try to execute the following Scala code:

sc.parallelize(1 to 100).count    

Unfortunately, the command doesn't work. The shell will just print the same warning endlessly:

INFO SparkContext: Starting job: count at <console>:13
INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (Parallel CollectionRDD[0] at parallelize at <console>:13), which has no missing parents
INFO DAGScheduler: Submitting 2 missing tasts from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Following my research into the issue, I have confirmed that the master URL I am using is identical to the one on the web UI. I can ping and ssh both ways (cluster to local VM, and vice-versa). Moreover, I have played with the executor-memory parameter (both increasing and decreasing the memory) to no avail. Finally, I tried disabling the firewall (iptables) on both sides, but I keep getting the same error. I am using Spark 1.0.2.

TL;DR Is it possible to run an Apache Spark shell remotely (and inherently submit applications remotely)? If so, what am I missing?

EDIT: I took a look at the worker logs and found that the workers had trouble finding Spark:

ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/bin/spark-1.0.2/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
...

Spark is installed in a different directory on my local VM than on the cluster. The path the worker is attempting to find is the one on my local VM. Is there a way for me to specify this path? Or must they be identical everywhere?

For the moment, I adjusted my directories to circumvent this error. Now, my Spark Shell fails before I get the chance to enter the count command (Master removed our application: FAILED). All the workers have the same error:

ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@spark02:7078] -> [akka.tcp://sparkExecutor@spark02:53633]:
Error [Association failed with [akka.tcp://sparkExecutor@spark02:53633]] 
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@spark02:53633] 
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$annon2: Connection refused: spark02/192.168.64.2:53633

As suspected, I am running into network issues. What should I look at now?

Nicolas
  • 221
  • 1
  • 2
  • 6
  • Can you please try the following two things. 1. Try connecting to the master from the node on which master is running. 2. Try replacing host names with IPs "everywhere". – Soumya Simanta Nov 02 '14 at 00:12
  • You can connect to a Spark cluster from a remote machine. Spark shell is just another Scala program that is running on the cluster. – Soumya Simanta Nov 02 '14 at 00:14
  • Yes, this is possible and should work. I suspect network issues. I'm not sure off the top of my head, but I think the workers will try to connect to your local machine on some port. From the symptoms I would guess that doesn't work out. Maybe you can find more information in the worker logs! – Daniel Darabos Nov 02 '14 at 15:37
  • You also should check network issues. I know about two kinds of issues. First - DNS problems both forward and reverse lookup should works from every ip and hostname from master, driver and workers. Second issue is several ip addresses on the driver or master. Check logs and find which IP address chosen by master and driver. Probably chosen address is not available from workers network. – 1esha Nov 03 '14 at 02:42
  • Everything works well if I'm launching the shell from one of the workers, or the master itself. I only run into this issue from my local VM. I also tried using IPs "everywhere" previously. Unfortunately, it did not do anything. I understand Spark shell is another Scala program. What I am trying to do is to submit my own Spark application. I am using Spark shell here because I want to take my programming / code out of the equation in order to isolate the issue. – Nicolas Nov 03 '14 at 15:54
  • You should check that the master machine can resolve your IP address/host. If not Akka will not be able to establish the communication properly. – Daniel H. Nov 21 '14 at 12:20
  • Thanks for the tip. As I mentioned previously, I am able to ping & ssh to and from the master, client and workers. I tried playing with /etc/hosts, but I am still stuck with the `ERROR akka.remote.EndpointWriter: AssociationError ... Connection refused` – Nicolas Nov 24 '14 at 13:57

2 Answers2

3

I solve this problem at my spark client and spark cluster。

Check your network,client A can ping cluster each other! Then add two line config in your spark-env.sh on client A。

first

export SPARK_MASTER_IP=172.100.102.156  
export SPARK_JAR=/usr/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar

Second

Test your spark shell with cluster mode !

Tay2510
  • 5,748
  • 7
  • 39
  • 58
Rocketeer
  • 51
  • 1
  • 4
  • 1
    The second recommendation doesn't really make sense: Running `spark-shell` with `--deploy-mode cluster` results in `Error: Cluster deploy mode is not applicable to Spark shells` and the question specifically addresses running a remote shell. – bluenote10 Aug 14 '16 at 10:01
2

This problem can be caused by the network configuration. It looks like the error TaskSchedulerImpl: Initial job has not accepted any resources can have quite a few causes (see also this answer):

  • actual resource shortage
  • broken communication between master and workers
  • broken communication between master/workers and driver

The easiest way to exclude the first possibilities is to run a test with a Spark shell running directly on the master. If this works, the cluster communication within the cluster itself is fine and the problem is caused by the communication to the driver host. To further analyze the problem it helps to look into the worker logs, which contain entries like

16/08/14 09:21:52 INFO ExecutorRunner: Launch command: 
    "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" 
    ... 
    "--driver-url" "spark://CoarseGrainedScheduler@192.168.1.228:37752"  
    ...

and test whether the worker can establish a connection to the driver's IP/port. Apart from general firewall / port forwarding issues, it might be possible that the driver is binding to the wrong network interface. In this case you can export SPARK_LOCAL_IP on the driver before starting the Spark shell in order to bind to a different interface.

Some additional references:

Community
  • 1
  • 1
bluenote10
  • 23,414
  • 14
  • 122
  • 178