16

I am trying to setup a remote conda interpreter on MacOS Mojave PyCharm for Anaconda 2019.1.2 Pro, and can't get it to work. My existing remote conda environment (v4.5.12) is running on an Ubuntu 16 EC2 machine, instantiated from Amazon's Deep Learning AMI

I tried setting up an ssh-interpreter, and directed it to: /home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python which is my conda environment. I then tried running a simple Tensorflow GPU test on this interpreter and got the following message, which strongly suggest the environment was not activated: (the server's IP address and company name were purposely obfuscated)

ssh://ubuntu@xx.xx.xx.xx:22/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python -u /home/ubuntu/company/DeepLearning_copy/apps/test_gpu.py
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/company/DeepLearning_copy/apps/test_gpu.py", line 1, in <module>
    import tensorflow as tf
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

Process finished with exit code 1

The code runs perfectly when SSHing into the server, running conda activate tensorflow_p36 and then python gpu_test.py.

I would appreciate any workarounds to allow remote debugging using an existing remote conda environment. In the meantime I've opened an issue with JetBrains, and with Anaconda community group.

Edit: please see a potential workaround in the JetBrains issue page

Assif
  • 161
  • 1
  • 7
  • Please provide context on what you have tried, what doesn't work. – Diane M May 12 '19 at 10:42
  • " strongly suggest that the environment wasn't activated." the traceback does suggest the use packages from the virtualenv, i.e the error raises from the venv's Tensorflow. It seems the error lies within Tensorflow and not from your python install. You probably have ran into a Tensorflow / cuda compatibility issue, like [some other users](https://github.com/tensorflow/tensorflow/issues/26209) – Diane M May 12 '19 at 11:55
  • Thanks @ArthurHavlicek. I believe this behaviour is consistent with setting up the PATH like [this](https://stackoverflow.com/a/34376031/11487943), but without running `source activate tensorflow_p36`. – Assif May 12 '19 at 12:22
  • I have yet to verify this, but I suspect (since you're using the AWS AMI) that it's because AWS compiled an optimized version of Tensorflow, which is installed when you first `conda activate` your environment (e.g. `conda activate tensorflow_p36`). Perhaps you could try re-installing tensorflow-gpu from pip and trying? – tauculator May 30 '19 at 10:22

4 Answers4

2

The thing you can do is:

  1. Go to "Run/Debug configuration"
  2. Under "Environment" you can see "Enviroment variables"
  3. You have to set proper path to cuda. In my case it was: "LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64"

I am also disappointed, that it is not done by default by JetBrains team.

0

Instead of specifying the path to "python", it works for me to specify the path to "activate" like this:

ssh [host] "source ~/anaconda3/bin/activate [name of conda env] ; cd [pick a dir] ; [command]"

For [command] try "conda env list" to see which environment is activated. Or you can do "python foo.py".

You may have to adjust the path "~/anaconda3/bin/activate".

mepster
  • 1
  • 1
-1

OP, it could be something that someone did to your environment that messed up the CUDA installation, like several others have mentioned.

I just provisioned a new Deep Learning AMI Instance on AWS - is this a viable option for you?

Anyway, I performed following steps after sshing to the (newly provisioned) server:

Initial activation

$ conda activate tensorflow_p36
WARNING: First activation might take some time (1+ min).
Installing TensorFlow optimized for your Amazon EC2 instance......
Env where framework will be re-installed: tensorflow_p36
Instance p2.xlarge is identified as a GPU instance, removing tensorflow-serving-cpu
Installation complete.

Scenario 1: Running GPU test from within tensorflow_p36 conda environment:

Do this to make sure that Tensorflow is working fine as per OP's scenario.

$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> # Creates a graph.
... a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
>>> b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
>>> c = tf.matmul(a, b)
>>> # Creates a session with log_device_placement set to True.
... sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
>>> # Runs the op.
... print(sess.run(c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]

Scenario 2: Deactivating the environment, and calling the same python executable as if within the environment.

This should be the same as configuring the remote interpreter to use a particular python interpreter. Notice that there's a lot more output after sess = tf.Session(...) compared to the case above, but everything still runs ok.

$ conda deactivate
$ /home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> # Creates a graph.
... a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
>>> b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
>>> c = tf.matmul(a, b)
>>> # Creates a session with log_device_placement set to True.
... sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-05-31 07:14:23.840474: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-31 07:14:23.841300: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ec160ca020 executing computations on platform CUDA. Devices:
2019-05-31 07:14:23.841334: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-05-31 07:14:23.843647: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2019-05-31 07:14:23.843845: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ec16131af0 executing computations on platform Host. Devices:
2019-05-31 07:14:23.843870: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-31 07:14:23.844965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2019-05-31 07:14:23.844992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-31 07:14:23.845991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-31 07:14:23.846013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-31 07:14:23.846020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-31 07:14:23.846577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2019-05-31 07:14:23.847176: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7

>>> # Runs the op.
... print(sess.run(c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:14:25.478310: I tensorflow/core/common_runtime/placer.cc:1059] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:14:25.478383: I tensorflow/core/common_runtime/placer.cc:1059] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:14:25.478413: I tensorflow/core/common_runtime/placer.cc:1059] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
[49. 64.]]

Scenario 3: Now trying to use the particular conda environment interpreter as a remote interpreter using Jetbrains PyCharm, within the PyCharm Python Console

Note that the output is basically the same as in Scenario 2 above, but the Tensorflow GPU test works fine and doesn't throw any errors.

ssh://ubuntu@XX.XX.XX.XX:22/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python -u /home/ubuntu/.pycharm_helpers/pydev/pydevconsole.py --mode=server

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 6.4.0
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
2019-05-31 07:17:03.883169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-31 07:17:03.883577: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55be28eef280 executing computations on platform CUDA. Devices:
2019-05-31 07:17:03.883609: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-05-31 07:17:03.886035: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2019-05-31 07:17:03.886752: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55be28f56d50 executing computations on platform Host. Devices:
2019-05-31 07:17:03.886777: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-31 07:17:03.886983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 508.38MiB
2019-05-31 07:17:03.887009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-31 07:17:03.887658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-31 07:17:03.887681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-31 07:17:03.887697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-31 07:17:03.887881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 283 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2019-05-31 07:17:03.889133: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:17:03.890673: I tensorflow/core/common_runtime/placer.cc:1059] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:17:03.890718: I tensorflow/core/common_runtime/placer.cc:1059] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:17:03.890750: I tensorflow/core/common_runtime/placer.cc:1059] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
[49. 64.]]
Community
  • 1
  • 1
tauculator
  • 147
  • 7
  • Thanks! Unfortunately when I deactivate and try to run using that env., it doesn't work (I get the same error as in the question). Therefore- it runs only when the env. is activated, and doesn't run when it's not. – Assif Jun 05 '19 at 10:11
  • @Assif - does it work if you provision a new instance? On another environment I ran into the same issue that you do, but on a newly provisioned instance everything runs ok. Did you change the instance type after provisioning, by any chance? e.g. `p2.xlarge -> non-GPU instance`, I suspect that may mess with the CUDA installation... – tauculator Jun 06 '19 at 01:11
  • Thank you @user5042861, I did not change the instance type after provisioning. The problem still persist when I provision a new instance. – Assif Mar 24 '20 at 23:19
-2

I think It's a cuda error. Cuda is not configured properly. Are you using tensorflow-gpu right??

Dhruv Rajkotia
  • 370
  • 2
  • 8
  • Thanks! I am certain the environment is setup well for 2 reasons: (1) It was setup by AWS via the [DL AMI](https://docs.aws.amazon.com/dlami/latest/devguide/overview-conda.html) (2) The code runs perfectly when SSHing into the server, running `conda activate tensorflow_p36` and then `python gpu_test.py` – Assif May 12 '19 at 12:52
  • Indeed, I am using tensorflow-gpu – Assif May 12 '19 at 12:56
  • So, I'm sure about this error, It's cuda error. Please reconfigure it. – Dhruv Rajkotia May 12 '19 at 12:58
  • This occurs only when I ran the script via PyCharm ssh-interpreter. If I SSH into the server, activate the environment, and then run the script- it runs perfectly. Therefore I am certain the environment is configured properly, **but** not activated by PyCharm upon remote script execution. – Assif May 12 '19 at 13:02