List all files in HDFS Python without pydoop

Question

I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reasons i can't upgrade to python 2.7. Due to this unfortunate fact i cannot install pydoop. Inside the hadoop cluster i have a large amount of raw data files named raw"yearmonthdaytimehour".txt everything in parenthesis is a number. Is there a way to make a list of all the files in a hadoop directory in python? So the program would create a list that looks something like.

listoffiles=['raw160317220001.txt', 'raw160317230001.txt', ....]

It would make everything i need to do a lot easier since to get the file from day 2 hour 15 i would just need to call dothing(listoffiles[39]). There are unrelated complications to why i have to do it this way.

I know there is a way to do this easily with local directories, but hadoop makes everything a little more complicated.

So you're asking for a way to list HDFS files in Python without pydoop? — kichik, Apr 02 '16 at 22:22
Just run the `hadoop fs -ls` command via a shell process (assuming you have the Hadoop binaries installed) — OneCricketeer, Apr 02 '16 at 22:24
im asking how to create an array containing the names of all the hdfs files. — Sam, Apr 02 '16 at 22:25
Yes i have full access to all hadoop commands from the command line. Do to fact that i cant install pydoop ive just been manually creating all commands and outputting them to the command line. But manually recreating these datename strings when random hours are skipped is difficult. This method seemed alot easier — Sam, Apr 02 '16 at 22:36

kichik · Answer 1 · 2016-04-02T23:03:59.197

1

If pydoop doesn't work, you can try the Snakebite library which should work with Python 2.6. Another option is enabling WebHDFS API and using that directly with requests or something similar.

print requests.get("http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS").json()

With Snakebite:

from snakebite.client import Client
client = Client("localhost", 8020, use_trash=False)
for x in client.ls(['/']):
    print x

edited Apr 02 '16 at 23:03

answered Apr 02 '16 at 22:37

kichik

33,220
7
94
114

how would i do it using snakebite? I got that to install. – Sam Apr 02 '16 at 23:01
I updated the answer to include the [example](http://snakebite.readthedocs.org/en/latest/client.html?highlight=list). – kichik Apr 02 '16 at 23:04
local host is defined as what is found in core-site.xml? or just the string "localhost"? Im getting an error "no module named client" with a lowecase c – Sam Apr 02 '16 at 23:32

score 1 · Answer 2 · answered Apr 05 '16 at 02:13

I would suggest checking out hdfs3

>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)
>>> hdfs.ls('/user/data')
>>> hdfs.put('local-file.txt', '/user/data/remote-file.txt')
>>> hdfs.cp('/user/data/file.txt', '/user2/data')

Like Snakebite, hdfs3 use protobufs for communication and bypasses the JVM. Unlike Snakebite, hdfs3 offers kerberos support

score 1 · Answer 3 · answered Sep 29 '16 at 04:17

1

I would recommend this Python project: https://github.com/mtth/hdfs It uses HttpFS and it's actually quite simple and fast. I've been using it on my cluster with Kerberos and works like a charm. You just need to set the namenode or HttpFs service URL.

answered Sep 29 '16 at 04:17

Leonel Atencio

474
3
14

List all files in HDFS Python without pydoop

3 Answers3

Linked