I have about 1500 XML files in HDFS, each of them is about 2-3Gb. I need to write a python script to parse the XML files to perform MapReduce. However, I am facing issue to access the files in HDFS using python.
I tried the following script, and receive an error.
from snakebite.client import Client
def connection():
hadoop_client = Client('HDFS_hostname', 'HDFS_port', use_trash=False)
for x in hadoop_client.ls(['/']):
print(x)
Following is the error:
Traceback (most recent call last):
File "/home/ubuntu/PycharmProjects/textmining/read_data_from_HDFS.py", line 5, in <module>
from snakebite.client import Client
File "/usr/local/lib/python3.6/dist-packages/snakebite/client.py", line 1473
baseTime = min(time * (1L << retries), cap);
^
SyntaxError: invalid syntax
What is the best recommended way to access files from HDFS using python?