1

I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reasons i can't upgrade to python 2.7. Due to this unfortunate fact i cannot install pydoop. Inside the hadoop cluster i have a large amount of raw data files named raw"yearmonthdaytimehour".txt everything in parenthesis is a number. Is there a way to make a list of all the files in a hadoop directory in python? So the program would create a list that looks something like.

listoffiles=['raw160317220001.txt', 'raw160317230001.txt', ....] 

It would make everything i need to do a lot easier since to get the file from day 2 hour 15 i would just need to call dothing(listoffiles[39]). There are unrelated complications to why i have to do it this way.

I know there is a way to do this easily with local directories, but hadoop makes everything a little more complicated.

Sam
  • 293
  • 3
  • 19
  • So you're asking for a way to list HDFS files in Python without pydoop? – kichik Apr 02 '16 at 22:22
  • Just run the `hadoop fs -ls` command via a shell process (assuming you have the Hadoop binaries installed) – OneCricketeer Apr 02 '16 at 22:24
  • im asking how to create an array containing the names of all the hdfs files. – Sam Apr 02 '16 at 22:25
  • Can you use the hadoopy library? – OneCricketeer Apr 02 '16 at 22:29
  • Yes i have full access to all hadoop commands from the command line. Do to fact that i cant install pydoop ive just been manually creating all commands and outputting them to the command line. But manually recreating these datename strings when random hours are skipped is difficult. This method seemed alot easier – Sam Apr 02 '16 at 22:36

3 Answers3

1

If pydoop doesn't work, you can try the Snakebite library which should work with Python 2.6. Another option is enabling WebHDFS API and using that directly with requests or something similar.

print requests.get("http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS").json()

With Snakebite:

from snakebite.client import Client
client = Client("localhost", 8020, use_trash=False)
for x in client.ls(['/']):
    print x
kichik
  • 33,220
  • 7
  • 94
  • 114
  • how would i do it using snakebite? I got that to install. – Sam Apr 02 '16 at 23:01
  • I updated the answer to include the [example](http://snakebite.readthedocs.org/en/latest/client.html?highlight=list). – kichik Apr 02 '16 at 23:04
  • local host is defined as what is found in core-site.xml? or just the string "localhost"? Im getting an error "no module named client" with a lowecase c – Sam Apr 02 '16 at 23:32
1

I would suggest checking out hdfs3

>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)
>>> hdfs.ls('/user/data')
>>> hdfs.put('local-file.txt', '/user/data/remote-file.txt')
>>> hdfs.cp('/user/data/file.txt', '/user2/data')

Like Snakebite, hdfs3 use protobufs for communication and bypasses the JVM. Unlike Snakebite, hdfs3 offers kerberos support

quasiben
  • 1,444
  • 1
  • 11
  • 19
1

I would recommend this Python project: https://github.com/mtth/hdfs It uses HttpFS and it's actually quite simple and fast. I've been using it on my cluster with Kerberos and works like a charm. You just need to set the namenode or HttpFs service URL.

Leonel Atencio
  • 474
  • 3
  • 14