I stumbled upon the same problem. and finally used a workaround using the HdfsCli module and tempfiles with Python3.4.
- imports:
from hdfs import InsecureClient
from tempfile import TemporaryFile
- create a hdfs client. In most cases, it is better to have a utility function somewhere in your script, like this one:
def get_hdfs_client():
return InsecureClient("<your webhdfs uri>", user="<hdfs user>",
root="<hdfs base path>")
- load and save your numpy inside a worker function:
hdfs_client = get_hdfs_client()
# load from file.npy
path = "/whatever/hdfs/file.npy"
tf = TemporaryFile()
with hdfs_client.read(path) as reader:
tf.write(reader.read())
tf.seek(0) # important, set cursor to beginning of file
np_array = numpy.load(tf)
...
# save to file.npy
tf = TemporaryFile()
numpy.save(tf, np_array)
tf.seek(0) # important ! set the cursor to the beginning of the file
# with overwrite=False, an exception is thrown if the file already exists
hdfs_client.write("/whatever/output/file.npy", tf.read(), overwrite=True)
Notes:
- the uri used to create the hdfs client begins with
http://
, because it uses the web interface of the hdfs file system;
- ensure that the user you pass to the hdfs client has read and write permissions
- in my experience, the overhead is not significant (at least in term of execution time)
- the advantage of using tempfiles (vs regular files in
/tmp
) is that you ensure no garbage files stay in the cluster machines after the script ends, normally or not