I am using CDH 5. How do I use Python to get all hdfs file creation dates under a directory? I don't like to use subprocess.Popen() and parse the results. code doesn't look very elegant.
Asked
Active
Viewed 1,946 times
1 Answers
1
Snakebite is a Python hdfs client. It has a list() method that will return file info including modification_time and has an example listed in its documentation here: http://spotify.github.io/snakebite/client.html#client.Client.ls
You can install it with pip. Python Package information for snakebite is here: https://pypi.python.org/pypi/snakebite/

Joe Young
- 5,749
- 3
- 28
- 27
-
What is the time it is returning?'modification_time': 1424181276458L? How to convert it to datetime object? – johnsam Feb 20 '15 at 01:56
-
@johnsam, according to the Hadoop Java API, it's "the modification time of file in milliseconds since January 1, 1970 UTC." https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileStatus.html#getModificationTime() To convert that to a datetime timestamp in Python, see here: http://stackoverflow.com/questions/21787496/converting-epoch-time-with-milliseconds-to-datetime – Joe Young Feb 20 '15 at 06:46
-
Client object requires namenode in the constructor. How to find out namenode in Snakebit? Or is it better to run namenode on all hosts in the cluster ? – johnsam Feb 20 '15 at 16:48
-
Why does client need namenode in the constructor? It shouldn't. Hdfs dfs command doesn't require namenode. – johnsam Feb 21 '15 at 17:12
-
HDFS is not completely distributed and requires a centralized namenode where all the filesystem information is kept. Datanodes don't know about what files they store, only what blocks they keep. Snakebite needs a namenode (or multiple in an HA setup) to query that for the file information. – xinit Feb 21 '15 at 23:01
-
And the hdfs command doesn't need the namenode, because that comes from your site-hdfs.xml in your hadoop configuration directory. – xinit Feb 23 '15 at 01:20