Listing HDFS directory on a remote machine using python

Question

I'm working on a log mining job using python. Before mapreduce, the program should know which files are in hdfs on a remote machine to make a list of log mining object files.

To do so, I need to execute a hadoop command hadoop fs -ls /var/log/*20161202* on a remote machine.

After a long search on google, I've failed to pick a pyspark interface that gets me the list of files. It seems pyspark doesn't provide such an interface.

And I saw an SO answer saying I need to use hdfscli and import that in my python script. Is this the only way too? I can't believe Spark don't have hdfs file listing method.

Try these answers: 1. http://stackoverflow.com/questions/33394884/spark-scala-list-folders-in-directory 2. http://stackoverflow.com/questions/27023766/spark-iterate-hdfs-directory 3.http://stackoverflow.com/questions/23352311/use-spark-to-list-all-files-in-a-hadoop-hdfs-directory — Rijul, Dec 13 '16 at 09:31
I think all you need is a ssh connection between your machine and hadoop master. Google "Execute commands on remote server using ssh" — Mohammad Yusuf, Dec 13 '16 at 09:56
You can use [webHDFS](https://hadoop.apache.org/docs/r1.0.4/webhdfs.html) and [pywebhdfs](https://pypi.python.org/pypi/pywebhdfs) too — jedijs, Dec 13 '16 at 10:52

desertnaut · Answer 1 · 2016-12-19T17:09:22.443

It is not clear what you mean by "remote" machine. If you mean a machine directly connected to (i.e. part of) the cluster, my other answer holds; if you mean a machine that is not part of the cluster, the answer, as @jedijs has suggested, is to use pywebhdfs (simply installl by pip install pywebhdfs):

from pywebhdfs.webhdfs import PyWebHdfsClient
from pprint import pprint

hdfs = PyWebHdfsClient(host='192.10.10.73',port='50070', user_name='ctsats')  # your Namenode IP & username here
my_dir = 'user/ctsats'
pprint(hdfs.list_dir(my_dir))

The result is a (rather long) Python dictionary (not shown) - experiment a little to get a feeling. You can parse it to get the names and types (file/directory) as follows:

data = hdfs.list_dir(my_dir)
pprint([[x["pathSuffix"], x["type"]] for x in data["FileStatuses"]["FileStatus"]])
# [[u'.Trash', u'DIRECTORY'],
#  [u'.sparkStaging', u'DIRECTORY'],
#  [u'checkpoint', u'DIRECTORY'],
#  [u'datathon', u'DIRECTORY'],
#  [u'ms-spark', u'DIRECTORY'],
#  [u'projects', u'DIRECTORY'],
#  [u'recsys', u'DIRECTORY'],
#  [u'sparklyr', u'DIRECTORY'],
#  [u'test.data', u'FILE'],
#  [u'word2vec', u'DIRECTORY']]

For comparison, here is the actual listing of the same directory:

[ctsats@dev-hd-01 ~]$ hadoop fs -ls
Found 10 items
drwx------   - ctsats supergroup          0 2016-06-08 13:31 .Trash
drwxr-xr-x   - ctsats supergroup          0 2016-12-15 20:18 .sparkStaging
drwxr-xr-x   - ctsats supergroup          0 2016-06-23 13:23 checkpoint
drwxr-xr-x   - ctsats supergroup          0 2016-02-03 15:40 datathon
drwxr-xr-x   - ctsats supergroup          0 2016-04-25 10:56 ms-spark
drwxr-xr-x   - ctsats supergroup          0 2016-06-30 15:51 projects
drwxr-xr-x   - ctsats supergroup          0 2016-04-14 18:55 recsys
drwxr-xr-x   - ctsats supergroup          0 2016-11-07 12:46 sparklyr
-rw-r--r--   3 ctsats supergroup         90 2016-02-03 16:55 test.data
drwxr-xr-x   - ctsats supergroup          0 2016-12-15 20:18 word2vec

The WebHDFS service in your Hadoop cluster must be enabled, i.e. your hdfs-site.xml file must include the following entry:

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

score 0 · Answer 2 · edited May 23 '17 at 12:24

You don't need pyspark specifics - standard Python ways to call system commands should do the trick:

>>> import os
>>> os.system("hadoop fs -ls")
Found 11 items
drwxr-xr-x   - oracle oracle          0 2016-12-02 07:25 .Trash
drwxr-xr-x   - oracle oracle          0 2016-11-18 06:48 .sparkStaging
drwx------   - oracle oracle          0 2016-12-06 11:10 .staging
drwxr-xr-x   - oracle oracle          0 2016-12-06 10:45 datathon
drwxr-xr-x   - hdfs   oracle          0 2016-10-24 16:16 indexMetadata
drwxr-xr-x   - hdfs   oracle          0 2016-10-24 16:14 jobRegistry
drwxr-xr-x   - oracle oracle          0 2016-10-04 19:29 mediademo
drwxr-xr-x   - oracle oracle          0 2016-10-04 19:30 moviedemo
drwxr-xr-x   - oracle oracle          0 2016-10-04 19:30 moviework
drwxr-xr-x   - oracle oracle          0 2016-10-04 19:30 oggdemo
drwxr-xr-x   - oracle oracle          0 2016-10-04 19:30 oozie-oozi
0
>>> os.system("hadoop fs -ls datathon")
Found 1 items
-rw-r--r--   3 oracle oracle    2810687 2016-12-06 10:44 datathon/2013_09_01.log
0

You can see more options and examples here.

oh.. but the machine on which hadoop command would be executed is not local computer. That's remote one. — choiapril, Dec 13 '16 at 09:43
@choiapril if your machine is part of a Hadoop cluster, the HDFS file system is common for the whole cluster and accessible from every machine in it (it is not "local"). If not, I cannot see the context of the question. — desertnaut, Dec 13 '16 at 09:48

Listing HDFS directory on a remote machine using python

2 Answers2