Can Tensorflow read from HDFS on Mac?

Question

I'm trying to coerce Tensorflow on OS/X to read from HDFS. The documentation

https://www.tensorflow.org/deploy/hadoop

does not clearly specify whether this is possible, and the code refers only to "posix" operating systems. The error I'm seeing when trying to use the HDFS is the following:

UnimplementedError (see above for traceback): File system scheme hdfs not implemented [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer)]]

Here's what I've done up to this point:

brew installed Hadoop 2.7.2
separately compiled Hadoop 2.7.2 for the native libraries. Hadoop is installed on /usr/local/Cellar/hadoop/2.7.2/libexec on my system, and the native libraries (libhdfs.dylib) are in ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2/lib/native.
Edited the code at https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L113-L119 to read from libhdfs.dylib rather than libhdfs.so, recompiled, and reinstalled Tensorflow. (I have to admit this is pretty boneheaded, and I have no idea if it's all that's required to make this code work on Mac.)

Here is the code to reproduce.

test.sh:

set -x

export JAVA_HOME=$($(dirname $(which java | xargs readlink))/java_home)
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.2/libexec

. $HADOOP_HOME/libexec/hadoop-config.sh

export HADOOP_HDFS_HOME=$(echo ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2)

export CLASSPATH=$($HADOOP_HDFS_HOME/bin/hdfs classpath --glob)

# Virtual environment with Tensorflow and necessary dependencies
. venv/bin/activate

python ./test.py

test.py:

import tensorflow as tf

_, example_bytes = tf.TFRecordReader().read(
    tf.train.string_input_producer(
        [
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00000",
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00001",
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00002",
            "hdfs://localhost:9000/user/foo/feature_output/part-r-00003",
        ]
    )
)

with tf.Session().as_default() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    print(len(sess.run(example_bytes)))

The code path I'm seeing in the Tensorflow source seems to indicate to me that I'd receive a different error than the one above if the issue were really mac-specific, since some kind of handler is registered for the "hdfs" scheme regardless: https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L474 . Has anyone else succeeded in coercing Tensorflow to work with Mac? If it isn't supported, is there an easy place to patch it?

I'm also open to suggestions as to what might be a better approach. The high-level goal is to efficiently train a model in parallel, using shared parameter servers, considering that each worker will only read a subset of the data. This is readily accomplished using the local filesystem, but it's less clear how to scale beyond that. Even if I do succeed in making the code above work, the result could suffer from problems with data locality.

This thread https://github.com/tensorflow/tensorflow/issues/2218 suggests using pyspark.RDD.toLocalIterator to iterate over the data set with a placeholder in the graph. Aside from my concern about forcing each worker to iterate through the full dataset, I don't see a way to coerce Tensorflow's builtin Estimator class to accept a custom feed function along with a specified input_fn, and a custom input_fn appears necessary in order to take advantage of models like LinearClassifier (https://www.tensorflow.org/tutorials/linear) that are capable of learning from sparse, weighted features.

Any thoughts?

score 1 · Accepted Answer · answered Mar 06 '17 at 18:45

1

Did you enable HDFS support in ./configure when building? That's the error you would get if HDFS is disabled.

I think you made the correct change to make it work. Feel free to send a pull request to look for .dylib on macOS.

answered Mar 06 '17 at 18:45

Jonathan Hseu

61
1

Thank you! I went through that prompt about 10 times and totally forgot about it. I fought with Bazel a bit but couldn't immediately coerce it into giving me a define for Darwin. If I get that working, I'll submit a patch. Changing to .dyllib was indeed all that was needed for me. – Josh Mar 08 '17 at 17:06

Can Tensorflow read from HDFS on Mac?

1 Answers1

Linked