0

I would like to capture certain information about each file that is in HDFS, such as: name, creation date, modification and last access. I thought about doing it using the Python OS module, but I'm not sure if it would be possible and also how to do it. Another alternative I thought would be to use the HDFS module itself, but the information about it on the internet is scarce and made it even more difficult.

Does anyone have any idea how I might be doing this?

Eduardo
  • 25
  • 7

1 Answers1

0

HDFS is not a normal filesystem that your computer can understand. Therefore, the os module will not be able to do anything with files store in HDFS.

You could try snakebite, which is a pure Python client for HDFS. There is an example on how to list files in HDFS using snakebite here.

  • 2
    You're correct that a separate library is needed to interact with HDFS, but I'd be wary of using Snakebite as it hasn't had a commit in 5 years https://github.com/spotify/snakebite and so I wouldn't expect it to work with recent versions of Hadoop. I'd recommend Pyarrow - https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs - it can access HDFS in two lines of code and is actively supported. – Ben Watson Oct 26 '21 at 07:20
  • Relevant discussion about different libraries: https://stackoverflow.com/questions/40285184/whats-the-best-module-for-interacting-with-hdfs-with-python3 – Ben Watson Oct 26 '21 at 07:27
  • Odd. Though it's not like me, I didn't check the latest commit at all :\ –  Oct 26 '21 at 13:23