1

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue. Unfortunately, though, I am not able to use spark-submit via ssh.

So, to recap:

  • This works:

      ubuntu@ip-198-43-52-121:~$ spark-submit job.py
    
  • This does not work:

      ssh -i file.pem ubuntu@blablablba.compute.amazon.com "spark-submit job.py"
    

Initially, I kept getting the following error message over and over:

'java.io.IOException: Cannot run program "python": error=2, No such file or directory'

After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's .bashrc file:

export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3

(As the error message referenced python, I also tried adding the line "alias python=python3" to .bashrc, but nothing changed)

After all this, if I try to submit the spark job via ssh I get the following error message:

"command spark-submit not found".

As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's .bashrc file before trying to run the spark job. As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:

ssh -i file.pem ubuntu@blabla.compute.amazon.com "source .bashrc; spark-submit job.file"


ssh -i file.pem ubuntu@blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE 


ssh -i file.pem ubuntu@blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE


(ssh -i file.pem ubuntu@blabla.compute.amazon.com "source .bashrc; spark-submit job.file")

All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.

I have also tried providing the full path running the following line:

ssh -i file.pem ubuntu@blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"

In this case too I get, once again, the following message:

'java.io.IOException: Cannot run program "python": error=2, No such file or directory'

How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?

It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated. The solution may be very simple, but I cannot get my head around it. Please help.

Thanks a lot in advance :)

felix
  • 21
  • 4

1 Answers1

1

The problem was indeed with the way I was expecting the shell to work (which was wrong).

My issue was solved by:

  1. Setting my variables in .profile instead of .bashrc
  2. Providing full path to python

Now I can launch spark jobs via ssh.

I found the solution in the answer @VinkoVrsalovic gave to this post:

Why does an SSH remote command get fewer environment variables then when run manually?

Cheers

felix
  • 21
  • 4