0

I am using CDH 5. How do I use Python to get all hdfs file creation dates under a directory? I don't like to use subprocess.Popen() and parse the results. code doesn't look very elegant.

johnsam
  • 4,192
  • 8
  • 39
  • 58

1 Answers1

1

Snakebite is a Python hdfs client. It has a list() method that will return file info including modification_time and has an example listed in its documentation here: http://spotify.github.io/snakebite/client.html#client.Client.ls

You can install it with pip. Python Package information for snakebite is here: https://pypi.python.org/pypi/snakebite/

Joe Young
  • 5,749
  • 3
  • 28
  • 27
  • What is the time it is returning?'modification_time': 1424181276458L? How to convert it to datetime object? – johnsam Feb 20 '15 at 01:56
  • @johnsam, according to the Hadoop Java API, it's "the modification time of file in milliseconds since January 1, 1970 UTC." https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileStatus.html#getModificationTime() To convert that to a datetime timestamp in Python, see here: http://stackoverflow.com/questions/21787496/converting-epoch-time-with-milliseconds-to-datetime – Joe Young Feb 20 '15 at 06:46
  • Client object requires namenode in the constructor. How to find out namenode in Snakebit? Or is it better to run namenode on all hosts in the cluster ? – johnsam Feb 20 '15 at 16:48
  • Why does client need namenode in the constructor? It shouldn't. Hdfs dfs command doesn't require namenode. – johnsam Feb 21 '15 at 17:12
  • HDFS is not completely distributed and requires a centralized namenode where all the filesystem information is kept. Datanodes don't know about what files they store, only what blocks they keep. Snakebite needs a namenode (or multiple in an HA setup) to query that for the file information. – xinit Feb 21 '15 at 23:01
  • And the hdfs command doesn't need the namenode, because that comes from your site-hdfs.xml in your hadoop configuration directory. – xinit Feb 23 '15 at 01:20