44

I'm working on some python code that extracts some image data from an ECW file using GDAL (http://www.gdal.org/) and its python bindings. GDAL was built from source to have ECW support.

The program is run on a cluster server that I ssh into. I have tested the program through the ssh terminal and it runs fine. However, I would now like to submit a job to the cluster using qsub, but it reports the following:

Traceback (most recent call last):
  File "./gdal-test.py", line 5, in <module>
    from osgeo import gdal
  File "/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/__init__.py", line 21, in <module>
    _gdal = swig_import_helper()
  File "/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/__init__.py", line 17, in swig_import_helper
    _mod = imp.load_module('_gdal', fp, pathname, description)
ImportError: /mnt/aeropix/prgs/.local/lib/libgdal.so.1: undefined symbol: H5Eset_auto2

I did a bit more digging and tried using LD_DEBUG=symbols to try and work out where the difference was, but that's about as far as my knowledge/understanding has got me.

For reference, here's what happens with LD_DEBUG=symbols and running the code in the ssh terminal (piping through grep H5Eset_auto2 to reduce some of the output):

Symbol debug output for code running in ssh terminal:

 11359: symbol=H5Eset_auto2;  lookup in file=/usr/bin/python26 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/_gdal.so [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libgcc_s.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/bin/python26 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/_gdal.so [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libgcc_s.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 11359: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]

Symbol debug output for code submitted using qsub:

 16915: symbol=H5Eset_auto2;  lookup in file=/usr/bin/python26 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/_gdal.so [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libm.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libgcc_s.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libpthread.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libc.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libdl.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libutil.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libjpeg.so.62 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpng12.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libpq.so.4 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libcurl.so.3 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libgssapi_krb5.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libkrb5.so.3 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libk5crypto.so.3 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libcom_err.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libidn.so.11 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libssl.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libcrypto.so.6 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSEcw.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSEcwC.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSCnet.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libNCSUtil.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/librt.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libxml2.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/mnt/aeropix/prgs/.local/lib/libz.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libcrypt.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libresolv.so.2 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libnsl.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/usr/lib64/libkrb5support.so.0 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libkeyutils.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libselinux.so.1 [0]
 16915: symbol=H5Eset_auto2;  lookup in file=/lib64/libsepol.so.1 [0]
 16915: /mnt/aeropix/prgs/.local/lib/libgdal.so.1: error: symbol lookup error: undefined symbol: H5Eset_auto2 (fatal)
ImportError: /mnt/aeropix/prgs/.local/lib/libgdal.so.1: undefined symbol: H5Eset_auto2

I guess I'm not sure why it seems to stop looking in libgdal.so.1 when submitted using qsub, when it continues to look when just run in the terminal. I also note that the qsub job is able to correctly locate libhdf5.so.7 (which is where it should find H5Eset_auto2 ) as it can find a different symbol, H5Eprint :

 16915: symbol=H5Eprint;  lookup in file=/usr/lib64/libpython2.6.so.1.0 [0]
 16915: symbol=H5Eprint;  lookup in file=/mnt/aeropix/prgs/.local/lib/libgdal.so.1 [0]
 16915: symbol=H5Eprint;  lookup in file=/usr/lib64/libstdc++.so.6 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libm.so.6 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libgcc_s.so.1 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libpthread.so.0 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libc.so.6 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libdl.so.2 [0]
 16915: symbol=H5Eprint;  lookup in file=/lib64/libutil.so.1 [0]
 16915: symbol=H5Eprint;  lookup in file=/mnt/aeropix/prgs/.local/lib/libhdf5.so.7 [0]

Any pointers on this would be incredibly useful at this stage (I hope that's enough information - I'm more than happy to provide more information, I'm just not sure what else might be useful at this stage).

EDIT:

It seems that the contents of /usr/bin are different for jobs submitted using qsub (specifically libtool is missing). This is being investigated.

agnussmcferguss
  • 485
  • 1
  • 4
  • 10
  • 1
    One thing that might help is to look at the environment variable `LD_LIBRARY_PATH` when you're running interactively and when it's running in the queue. Look for any discrepancies related to HDF5. – Rufflewind Nov 21 '14 at 05:40
  • I don't think so, I think there might be something more going on than that. I know the qsub job can find `libhdf5.so.7` as it correctly locates symbol `H5Eprint` before getting stuck on `H5Eset_auto2` As a minimum working example, the single line `from osgeo import gdal` works when I run it through the ssh terminal, but fails when submitted using qsub. Also, I have set `export LD_LIBRARY_PATH=/mnt/aeropix/prgs/.local/lib` (which is where `libhdf5.so.7` is) – agnussmcferguss Nov 21 '14 at 05:45
  • I don't have an answer yet but if you look at the very first output (interactive), notice that the lines consists of two identical groups, one repeated after the other. However, in the second output (in the queue system) the first group of lines is the same but the second group of lines looks very different. – Rufflewind Nov 21 '14 at 05:47
  • 1
    Another thing that might help you debug: capture *all* the environment variables in your job script (`env >$HOME/myenvs`), then reproduce that same environment (`env $(xargs <$HOME/myenvs) bash`) and see if it fails in the exact same way. If it does, examine and look for any suspicious discrepancies from your interactive environment. – Rufflewind Nov 21 '14 at 05:51
  • That's okay, thanks for looking into it anyway! I hadn't actually noticed they were two identical groups (or not in the second case) until now. I did notice that the second one, at least in the first half, looks at `libhdf5.so.7` but then continues onward? – agnussmcferguss Nov 21 '14 at 05:53
  • 1
    It suggests that there are two separate "processes": first one works just fine and appears to be using the identical environment, whereas the second one does not and appears to be "doing its own thing" (and thus failing to find HDF5). – Rufflewind Nov 21 '14 at 05:56
  • Okay... is that normal? also, I tried replicating the cluster environment variables but it didn't throw an error in the ssh terminal – agnussmcferguss Nov 21 '14 at 06:04
  • That's really odd. So I looked at the code of the GDAL bindings and it appears to be trying to load the libraries twice. Can you try running this interactively? Also try it in the queue to see if you get the same results. `f, p, d = imp.find_module('_gdal', ['/home/h3/ctargett/.local/lib/python2.6/site-packages/GDAL-1.11.1-py2.6-linux-x86_64.egg/osgeo/']); print(f, p, d); imp.load_module('_gdal', f, p, d)` (Replace the semicolons with newlines if you wish.) The code essentially tries to replicate what is done by the GDAL bindings, and hopefully will help narrow down the problem. – Rufflewind Nov 21 '14 at 07:00
  • (Note: prepend `import imp;`) – Rufflewind Nov 21 '14 at 07:06
  • Sorry for the long response time, I was out of the office for the weekend. When I run that that interactively I get: https://drive.google.com/file/d/0By6cKos4YGCGWEtQNWM1ck9veVk/view?usp=sharing However, when run in the queue I get: https://drive.google.com/file/d/0By6cKos4YGCGWHlndERYZ1hlaDg/view?usp=sharing – agnussmcferguss Nov 23 '14 at 23:03
  • That's really odd. Sorry, I don't have any good guesses right now, but you can try checking other things and see if they make any difference: Are jobs run on a different user? Are the Python executables different? (versions?) Are the shells different? (e.g. Bash) Is there any clue from the detailed output of the dynamic loader? (`LD_DEBUG`) Does loading `gdal` work if you execute it non-interactively? (E.g. `ssh user@cluster 'python -c "import osgeo.gdal"'`) – Rufflewind Nov 24 '14 at 03:08
  • It is indeed quite odd. The jobs are run with the same user name and the python executables are both the same (2.6.8). The shells might be different - they're both bash, but the interactive one seems to use /bin/bash whereas the queue uses /usr/local/bin/bash. However, I've tried using both and they both work fine interactively and both fail when used with the queue. I tried requesting an interactive session from the queue (using `qsub -I`), and that results in the same error (`undefined symbol: H5Eset_auto2`) as submitting the job. – agnussmcferguss Nov 24 '14 at 05:26
  • Using `ssh user@cluster 'export LD_LIBRARY_PATH=/mnt/aeropix/prgs/.local/lib ; python26 -c "from osgeo import gdal"'` works fine though – agnussmcferguss Nov 24 '14 at 05:36
  • Can you try putting the `LD_LIBRARY_PATH` variable export in your `.bashrc` and see if it affects the job? On another note, would you be able to paste the (full, ungrepped) logs from the dynamic loader? (use `LD_DEBUG=all`) – Rufflewind Nov 24 '14 at 05:47
  • I put `LD_LIBRARY_PATH` in my `.bashrc` but there was no change. Here are the links to the full debug logs (they're quite large). ssh output: https://drive.google.com/file/d/0By6cKos4YGCGYjl3Z2twaGVTYVE/view?usp=sharing qsub output: https://drive.google.com/file/d/0By6cKos4YGCGT2t5ZGpycGYzaU0/view?usp=sharing – agnussmcferguss Nov 24 '14 at 06:08
  • 1
    Can you also check if `libtool` is of the same version? – Rufflewind Nov 24 '14 at 06:43
  • 1
    Can you run this both interactively and in the queue and look at the output? They ought to produce the exact same output - I can't imagine how they wouldn't be but I'm out of ideas :\ `nm -D /mnt/aeropix/prgs/.local/lib/libhdf5.so.7 | grep H5Eset_auto2` – Rufflewind Nov 24 '14 at 07:10
  • The output of `nm -D` using ssh gives `0000003126a915b0 T H5Eset_auto2`, and in the queue gives nothing at all. – agnussmcferguss Nov 24 '14 at 22:59
  • Well this is bizarre, when I do `libtool --version` in the ssh terminal I get that it's version `1.5.22 (1.1220.2.365)`, while in the queue I get `libtool: command not found`. I tried doing `whereis libtool` and got `/usr/bin/libtool` for the ssh terminal, and nothing found when in the queue. Doing `ls /usr/bin` I find that they appear to be different for the ssh terminal and the queue - I am investigating this further. – agnussmcferguss Nov 24 '14 at 23:05
  • Just be to sure, do an md5sum and see if the queue sees the exact same library file (libhdf5.so.7) as in interactive mode. – Rufflewind Nov 24 '14 at 23:30
  • md5sum in interactive mode: 1560c4cf2f8bd0c5e64879f423f44bdf md5sum in queue: e01aad433d9d6ff2f8e48b07280d6442 they do appear to be different, although I've no idea how. – agnussmcferguss Nov 24 '14 at 23:36
  • 1
    Interesting ... can you create a copy of `libhdf5.so.7` somewhere else and see if the queue sees the same file? (If it does, try putting that new location in `LD_LIBRARY_PATH` instead.) – Rufflewind Nov 24 '14 at 23:59
  • Solution found! Turns out `libhdf5.so.7` was symlinked to a location that was apparently different in the queue. I modified the symlink and now everything seems to be working correctly. Would you like to answer the question? You did most of the work so I'd say you earned it. – agnussmcferguss Nov 25 '14 at 00:11
  • That's good to hear! You've done your part too investigating this. (In hindsight this probably should've been the first thing to check.) – Rufflewind Nov 25 '14 at 00:12
  • Ah yes, hindsight - I guess the next clue really should've been when we found that `/usr/bin` was different for the queue. For further reference, after being symlinked all over the place, `libhdf5.so.7` ended up pointing to `/usr/lib64/libhdf5.so.7.0.1`. Thanks for your help, you're a champ. – agnussmcferguss Nov 25 '14 at 00:20
  • Seems that gdal has mostly been a red herring here. One thing that puzzles me: how did that symlink get created? It seems like a rather bad idea! – Rufflewind Nov 25 '14 at 00:24
  • Honestly, I have no idea any more. GDAL is painful at best to get working when you need to build it from scratch (which I had to, as I needed ECW support). It actually took me weeks to even get it to build as I ran into a plethora of problems including gcc 4.1.2 not playing nicely with the first version of the ECW library we got our hands on (and the associated 'hoop jumping' you need to do in order to request changes to installed software on the cluster). In the end we got an older ECW library and just built and installed it in a user area. – agnussmcferguss Nov 25 '14 at 00:39

1 Answers1

38

After two dozens of comments to understand the situation, it was found that the libhdf5.so.7 was actually a symlink (with several levels of indirection) to a file that was not shared between the queued processes and the interactive processes. This means even though the symlink itself lies on a shared filesystem, the contents of the file do not and as a result the process was seeing different versions of the library.

For future reference: other than checking LD_LIBRARY_PATH, it's always a good idea to check a library with nm -D to see if the symbols actually exist. In this case it was found that they do exist in interactive mode but not when run in the queue. A quick md5sum revealed that the files were actually different.

Rufflewind
  • 8,545
  • 2
  • 35
  • 55
  • 7
    Also note, when following symlinks, remember to keep following them until you've reached the end. – agnussmcferguss Nov 25 '14 at 00:23
  • 2
    The "For future reference" bit was great advice that helped me solve my problem. Amazing how specific this question is, yet given all the upvotes, how ubiquitously it applies. – mbroshi Apr 27 '18 at 15:58
  • 2
    e.g. nm -D /usr/lib/x86_64-linux-gnu/mesa/libGL.so.1 | grep _glapi_tls_Dispatch – daemondave Dec 21 '18 at 20:19