I want to connect pyarrow to read to and write parquet file in hdfs But I am facing some connectivity issue
I installed pyarrow and python pandas now I am trying to connect with hdfs in remote machine
Reference link - https://towardsdatascience.com/a-gentle-introduction-to-apache-arrow-with-apache-spark-and-pandas-bb19ffe0ddae
import pyarrow as pa
host = '172.17.0.2'
port = 8020
fs = pa.hdfs.connect(host, port)
Error messages
>>> fs = pa.hdfs.connect(host, port) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python2.7/site-packages/pyarrow/hdfs.py", line 211, in connect extra_conf=extra_conf) File "/usr/lib64/python2.7/site-packages/pyarrow/hdfs.py", line 36, in __init__ _maybe_set_hadoop_classpath() File "/usr/lib64/python2.7/site-packages/pyarrow/hdfs.py", line 136, in _maybe_set_hadoop_classpath classpath = _hadoop_classpath_glob('hadoop') File "/usr/lib64/python2.7/site-packages/pyarrow/hdfs.py", line 161, in _hadoop_classpath_glob return subprocess.check_output(hadoop_classpath_args) File "/usr/lib64/python2.7/subprocess.py", line 568, in check_output process = Popen(stdout=PIPE, *popenargs, **kwargs) File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child raise child_exception OSError: [Errno 2] No such file or directory