Can't access directory from HDFS inside a Python script

Question

I have the following python script(I managed to run it locally):

#!/usr/bin/env python3

import folderstats

df = folderstats.folderstats('hdfs://quickstart.cloudera.8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)

df.to_csv(r'hdfs://quickstart.cloudera.8020/user/cloudera/files.csv', sep=',', index=True)

I have the directory: "files" in that location. I checked this through the command line and even with HUE, and it's there.

(myproject) [cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera
Found 1 items
drwxrwxrwx   - cloudera cloudera          0 2019-06-01 13:30 /user/cloudera/files

The problem is that the directory can't be accessed.

I tried to run it in my local terminal: python3 script.py and even with super-user like: sudo -u hdfs python3 script.py and the out says:

Traceback (most recent call last):
  File "script.py", line 5, in <module>
    df = folderstats.folderstats('hdfs://quickstart.cloudera:8020/user/cloudera/files', hash_name='md5', ignore_hidden=True)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 88, in folderstats
    verbose=verbose)
  File "/home/cloudera/miniconda3/envs/myproject/lib/python3.7/site-packages/folderstats/__init__.py", line 32, in _recursive_folderstats
    for f in os.listdir(folderpath):
FileNotFoundError: [Errno 2] No such file or directory: 'hdfs://quickstart.cloudera:8020/user/cloudera/files'

Can you please help me clarify this issue?

Thank you!

If you want to process CSV on HDFS using Python, Pyspark or maybe PyArrow (with Pandas), would be the only options I know of — OneCricketeer, Jun 04 '19 at 03:21
Thank you for your answer. I basically try to "scan" all the files in a directory which is located in HDFS from my python script. I don't know how to access this directory. — TheRichUncle, Jun 04 '19 at 07:04
HDFS is not your traditional filesystem. All requests must be funneled through the namenode. WebHdfs has a LIST operation that may be useful, but you'd have to somehow know which returned entries are directories vs files — OneCricketeer, Jun 04 '19 at 07:09
I already tried pyhdfs(https://pypi.org/project/pywebhdfs/) but without success. It would work if I want to write / read for example a .txt file from HDFS. Java looks like the fastest option right now. — TheRichUncle, Jun 04 '19 at 07:11
Webhdfs needs enabled first, then, did you try the list_dir function? https://pythonhosted.org/pywebhdfs/#pywebhdfs.webhdfs.PyWebHdfsClient.list_dir — OneCricketeer, Jun 04 '19 at 08:07

score 2 · Answer 1 · answered Jun 04 '19 at 02:53

2

Python runs on a single machine with a local linux (or windows) filesystem (FS).

Hadoop's HDFS project is a distributed file system setup across many machines (nodes).

There may be some custom class out there to read HDFS data in a single machine however I am not aware of any and it defeats the purpose of distributed computing.

You could either copy (source HDFS location => target Local FS location) your data from HDFS to local filesystem via hadoop fs -get hdfs://quickstart.cloudera:8020/user/cloudera/files /home/user/<target_directory_name> where Python lives or use something like Spark, Hive, or Impala to process/query the data.

If the data volume is quite small then copying the files from HDFS to Local FS to execute python script should be efficient for something like Cloudera Quickstart VM.

answered Jun 04 '19 at 02:53

thePurplePython

2,621
1
13
34

Thank you for your answer! Unfortunately, I expect to have large volume of files in that HDFS directory. So to copy the content from HDFS to local and apply the script won't be efficient. I think I will give it a try using Java. It seems to offer better support for this kind of task. – TheRichUncle Jun 04 '19 at 07:22
@TheRichUncle - see if below link can solve your problem. https://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs – vikrant rana Jun 04 '19 at 11:04
how much data do you have? is there a reason you can't use one of the hadoop services like spark, hive, impala, kafka? – thePurplePython Jun 05 '19 at 18:51

Can't access directory from HDFS inside a Python script

1 Answers1