How to properly setup pyarrow for python 3.7 on Windows

Question

I've been trying pyarrow installation via pip (pip install pyarrow, and, as suggested Yagav: py -3.7 -m pip install --user pyarrow) and conda (conda install -c conda-forge pyarrow, also used conda install pyarrow) , building lib from src (using conda environment and some magic, which I don’t really understand), but all the time, after installation (with no errors) it ends with one and the same problem, when I call:

import pyarrow as pa
fs = pa.hdfs.connect(host='my_host', user='my_user@my_host', kerb_ticket='path_to_kerb_ticket')

it fails with next message:

Traceback (most recent call last):
  File "", line 1, in 
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 209, in connect
    extra_conf=extra_conf)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__
    _maybe_set_hadoop_classpath()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 135, in _maybe_set_hadoop_classpath
    classpath = _hadoop_classpath_glob(hadoop_bin)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 162, in _hadoop_classpath_glob
    return subprocess.check_output(hadoop_classpath_args)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 395, in check_output
    **kwargs).stdout
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 472, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1178, in _execute_child
    startupinfo)
OSError: [WinError 193] %1 is not a valid win32 application

At first I was thinking, that there is a problem with libhdfs.so from Hadoop 2.5.6, but it seems that I was wrong about that. I guess, there is a problem not in the pyarrow or subprocess, but some system variables or dependencies.

Also I have manually defined system variables as HADOOP_HOME, JAVA_HOME and KRB5CCNAME

_I've been trying pyarrow installation via pip and conda, building lib from src_ Can you expand on that? How exactly did you install it? Did it succeed? — AMC, Mar 12 '20 at 20:25
As I expanded the text, I’ve used the following methods: pip install pyarrow, py -3.7 -m pip install --user pyarrow, conda install pyarrow, conda install -c conda-forge pyarrow, also builded pyarrow from src and dropped it into site-packages of python conda folder — Eliot Leshchenko, Mar 13 '20 at 04:10

score 1 · Answer 1 · answered Mar 20 '20 at 11:16

Ok, I found it by myself. As I've been thinking, the broblem was in system environvent variables, it needs to have CLASSPATH variable, which contains paths to all .jar files of hadoop client, you can get them using hadoop classpath or hadoop classpath --glob in cmd.

score 1 · Answer 2 · answered Mar 16 '21 at 07:54

This one is the solution :

def _maybe_set_hadoop_classpath():
    import subprocess

    if 'hadoop' in os.environ.get('CLASSPATH', ''):
        return

    if 'HADOOP_HOME' in os.environ:
        hadoop_bin = os.path.normpath(os.environ['HADOOP_HOME'])  +"/bin/"  #'{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
    else:
        hadoop_bin = 'hadoop'

    os.chdir(hadoop_bin)
    hadoop_bin_exe = os.path.join(hadoop_bin, 'hadoop.cmd')
    print(hadoop_bin_exe)
    classpath = subprocess.check_output([hadoop_bin_exe, 'classpath', '--glob'])
    os.environ['CLASSPATH'] = classpath.decode('utf-8')

score 0 · Answer 3 · answered Mar 12 '20 at 13:21

0

you could use the code n the cmd so you install pyarrow correctely.

py -3.7 -m pip install --user pyarrow

after installing try the the code.

answered Mar 12 '20 at 13:21

Yagav

190
1
7

It didn't helped, the error message is still there and the same – Eliot Leshchenko Mar 12 '20 at 13:36

How to properly setup pyarrow for python 3.7 on Windows

3 Answers3

Linked