6

My environment: I'm using Hortonworks HDP 2.4 with Spark 1.6.1 on a small AWS EC2 cluster of 4 g2.2xlarge instances with Ubuntu 14.04. Each instance has CUDA 7.5, Anaconda Python 3.5, and Pycuda 2016.1.1.

in /etc/bash.bashrc I've set:

CUDA_HOME=/usr/local/cuda
CUDA_ROOT=/usr/local/cuda
PATH=$PATH:/usr/local/cuda/bin

On all 4 machines I can access nvcc from the command line for the ubuntu user, the root user, and the yarn user.

My problem: I have a Python-Pycuda project I've adapted to run on Spark. It runs great on my local Spark installation on my Mac, but when I run it on AWS I get:

FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'

since it runs on my Mac in local mode, my guess is that it is a configuration issue with CUDA/Pycuda in the worker processes but I'm really stumped as to what it could be.

Any ideas?

Edit: Below is a stack trace from one of the jobs failing:

16/11/10 22:34:54 INFO ExecutorAllocationManager: Requesting 13 new executors because tasks are backlogged (new desired total will be 17)
16/11/10 22:34:57 INFO TaskSetManager: Starting task 16.0 in stage 2.0 (TID 34, ip-172-31-26-35.ec2.internal, partition 16,RACK_LOCAL, 2148 bytes)
16/11/10 22:34:57 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on ip-172-31-26-35.ec2.internal:54657 (size: 32.2 KB, free: 511.1 MB)
16/11/10 22:35:03 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 18, ip-172-31-26-35.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 46, in call_capture_output
    popen = Popen(cmdline, cwd=cwd, stdin=PIPE, stdout=PIPE, stderr=PIPE)
  File "/home/ubuntu/anaconda3/lib/python3.5/subprocess.py", line 947, in __init__
    restore_signals, start_new_session)
  File "/home/ubuntu/anaconda3/lib/python3.5/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hadoop/yarn/local/usercache/ubuntu/appcache/application_1478814770538_0004/container_e40_1478814770538_0004_01_000009/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/hadoop/yarn/local/usercache/ubuntu/appcache/application_1478814770538_0004/container_e40_1478814770538_0004_01_000009/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2346, in pipeline_func
  File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 317, in func
  File "/home/ubuntu/pycuda-euler/src/cli_spark_gpu.py", line 36, in <lambda>
    hail_mary = data.mapPartitions(lambda x: ec.assemble2(k, buffer=x, readLength = dataLength,readCount=dataCount)).saveAsTextFile('hdfs://172.31.26.32/genome/sra_output')
  File "./eulercuda.zip/eulercuda/eulercuda.py", line 499, in assemble2
    lmerLength, evList, eeList, levEdgeList, entEdgeList, readCount)
  File "./eulercuda.zip/eulercuda/eulercuda.py", line 238, in constructDebruijnGraph
    lmerCount, h_kmerKeys, h_kmerValues, kmerCount, numReads)
  File "./eulercuda.zip/eulercuda/eulercuda.py", line 121, in readLmersKmersCuda
    d_lmers = enc.encode_lmer_device(buffer, partitionReadCount, d_lmers, readLength, lmerLength)
  File "./eulercuda.zip/eulercuda/pyencode.py", line 78, in encode_lmer_device
    """)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 265, in __init__
    arch, code, cache_dir, include_dirs)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 255, in compile
    return compile_plain(source, options, keep, nvcc, cache_dir, target)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 78, in compile_plain
    checksum.update(preprocess_source(source, options, nvcc).encode("utf-8"))
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/compiler.py", line 50, in preprocess_source
    result, stdout, stderr = call_capture_output(cmdline, error_on_nonzero=False)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 197, in call_capture_output
    return forker[0].call_capture_output(cmdline, cwd, error_on_nonzero)
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pytools/prefork.py", line 54, in call_capture_output
    % ( " ".join(cmdline), e))
pytools.prefork.ExecError: error invoking 'nvcc --preprocess -arch sm_30 -I/home/ubuntu/anaconda3/lib/python3.5/site-packages/pycuda/cuda /tmp/tmpkpqwoaxf.cu --compiler-options -P': [Errno 2] No such file or directory: 'nvcc'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
zenlc2000
  • 451
  • 4
  • 9

2 Answers2

1

To close the loop on this, I finally worked my way through the problem.

Note: I know this is not really a good nor permanent answer for most people however in my case I am running POC code for my dissertation and as soon as I get some final results I'm decommissioning the servers. I doubt this answer will be suitable or appropriate for most users.

I ended up hardcoding the full path to nvcc into compile_plain() in Pycuda's compiler.py file.

Partial listing:

def compile_plain(source, options, keep, nvcc, cache_dir, target="cubin"):
    from os.path import join

    assert target in ["cubin", "ptx", "fatbin"]
    nvcc = '/usr/local/cuda/bin/'+nvcc
    if cache_dir:
        checksum = _new_md5()

Hopefully this points someone else in the proper direction.

zenlc2000
  • 451
  • 4
  • 9
0

The error means that nvcc is not in PATH for the process that runs the code.

Amazon ECS Container Agent Configuration - Amazon EC2 Container Service has instructions on how to set up environment variables for the cluster.

For the same in Hadoop, there's Configuring Environment of Hadoop Daemons – Hadoop Cluster Setup.

ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • Yep, I know all that and spent almost 2 weeks trying to find the "right" way to send that path to the Spark workers. – zenlc2000 Nov 14 '16 at 18:55
  • @zenlc2000 so, what's the problem with that? Since environment is inherited, you only need to set it for some "root" process which is just the case the link covers AFAICS. Unless you run your worker in some weird way and/or pyspark resets the environment for some weird reason - [the latter doesn't seem likely](http://stackoverflow.com/questions/35576621/setting-environment-variables-from-python-code-for-spark). – ivan_pozdeev Nov 14 '16 at 19:00
  • The processes get run by the yarn user on the Hadoop cluster. As stated above, when I su to yarn I can run nvcc from the command line but when the job is run by the framework, the workers do not have the path. Ultimately I found a workaround that is "good enough." If I continue this line of research I'll find the "right" way to do it, when my degree and graduation aren't on the line. – zenlc2000 Nov 14 '16 at 19:50
  • BTW - I do appreciate your attention and attempts at answering the question. Thank you. – zenlc2000 Nov 14 '16 at 19:52
  • @zenlc2000 Now it's not AWS EC2 but Hadoop? Well, it's basically the same there (updated). – ivan_pozdeev Nov 15 '16 at 00:30
  • Hadoop is a big data processing framework that I have running on AWS EC2. – zenlc2000 Nov 15 '16 at 03:25