0

I am trying to use pyarrow on Windows but I'm getting the following error with fs.HadoopFileSystem() :

OSError                                   Traceback (most recent call last)
Cell In[1], line 2
      1 from pyarrow import fs
----> 2 hdfs = fs.HadoopFileSystem(host='localhost', port=9870)

File c:\prj\study\.venv\lib\site-packages\pyarrow\_hdfs.pyx:96, in pyarrow._hdfs.HadoopFileSystem.__init__()

File c:\prj\study\.venv\lib\site-packages\pyarrow\error.pxi:144, in pyarrow.lib.pyarrow_internal_check_status()

File c:\prj\study\.venv\lib\site-packages\pyarrow\error.pxi:115, in pyarrow.lib.check_status()

OSError: Unable to load libhdfs: 指定されたモジュールが見つかりません。

I followed the steps on this site to install Hadoop using binaries from Apache and I am able to use it through cmd. However when I checked lbhdfs.so in lib/native, it shows as a 0 kb file. Is this normal, or do I have to compile Hadoop source on my own so I could get the correct libhdfs.so?

  • Can you please translate the error? And no, a 0kb file isn't normal. Please use official Apache documentation to install Hadoop. Also, what benefit does pyarrow provide you over pyspark? – OneCricketeer Dec 20 '22 at 14:29
  • @OneCricketeer It says that the specified module cannot be found so most likely it is due to libhdfs.so being empty. Official docs says to build it from source but I can't do it with my company PC because some security settings conflict with bash. Regarding pyarrow vs. pyspark, I am trying to implement MLflow right now and their method of connecting to HDFS is via pyarrow. – shane.singwa Dec 21 '22 at 02:27
  • I've not compiled the libhdfs on windows before. Plus, the Hadoop document also says they are only for *nix platforms – OneCricketeer Dec 22 '22 at 18:36

0 Answers0