I'm trying to coerce Tensorflow on OS/X to read from HDFS. The documentation
https://www.tensorflow.org/deploy/hadoop
does not clearly specify whether this is possible, and the code refers only to "posix" operating systems. The error I'm seeing when trying to use the HDFS is the following:
UnimplementedError (see above for traceback): File system scheme hdfs not implemented [[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer)]]
Here's what I've done up to this point:
- brew installed Hadoop 2.7.2
- separately compiled Hadoop 2.7.2 for the native libraries. Hadoop is installed on /usr/local/Cellar/hadoop/2.7.2/libexec on my system, and the native libraries (libhdfs.dylib) are in ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2/lib/native.
- Edited the code at https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L113-L119 to read from libhdfs.dylib rather than libhdfs.so, recompiled, and reinstalled Tensorflow. (I have to admit this is pretty boneheaded, and I have no idea if it's all that's required to make this code work on Mac.)
Here is the code to reproduce.
test.sh:
set -x
export JAVA_HOME=$($(dirname $(which java | xargs readlink))/java_home)
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.2/libexec
. $HADOOP_HOME/libexec/hadoop-config.sh
export HADOOP_HDFS_HOME=$(echo ~/Source/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-2.7.2)
export CLASSPATH=$($HADOOP_HDFS_HOME/bin/hdfs classpath --glob)
# Virtual environment with Tensorflow and necessary dependencies
. venv/bin/activate
python ./test.py
test.py:
import tensorflow as tf
_, example_bytes = tf.TFRecordReader().read(
tf.train.string_input_producer(
[
"hdfs://localhost:9000/user/foo/feature_output/part-r-00000",
"hdfs://localhost:9000/user/foo/feature_output/part-r-00001",
"hdfs://localhost:9000/user/foo/feature_output/part-r-00002",
"hdfs://localhost:9000/user/foo/feature_output/part-r-00003",
]
)
)
with tf.Session().as_default() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
print(len(sess.run(example_bytes)))
The code path I'm seeing in the Tensorflow source seems to indicate to me that I'd receive a different error than the one above if the issue were really mac-specific, since some kind of handler is registered for the "hdfs" scheme regardless: https://github.com/tensorflow/tensorflow/blob/v1.0.0/tensorflow/core/platform/hadoop/hadoop_file_system.cc#L474 . Has anyone else succeeded in coercing Tensorflow to work with Mac? If it isn't supported, is there an easy place to patch it?
I'm also open to suggestions as to what might be a better approach. The high-level goal is to efficiently train a model in parallel, using shared parameter servers, considering that each worker will only read a subset of the data. This is readily accomplished using the local filesystem, but it's less clear how to scale beyond that. Even if I do succeed in making the code above work, the result could suffer from problems with data locality.
This thread https://github.com/tensorflow/tensorflow/issues/2218 suggests using pyspark.RDD.toLocalIterator to iterate over the data set with a placeholder in the graph. Aside from my concern about forcing each worker to iterate through the full dataset, I don't see a way to coerce Tensorflow's builtin Estimator class to accept a custom feed function along with a specified input_fn, and a custom input_fn appears necessary in order to take advantage of models like LinearClassifier (https://www.tensorflow.org/tutorials/linear) that are capable of learning from sparse, weighted features.
Any thoughts?