I want to import pyarrow
in a Python shell Glue script because I need to export a dataframe as parquet (i.e. with DataFrame.to_parquet()
).
The way to add custom dependencies suggested in the AWS docs is to use .egg
or .whl
files (https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library).
The library pyarrow
has numpy
and six
as dependencies:
numpy
is already pre-installed on Glue, with version1.16.2
as I checked with a simpleprint(numpy.version.version)
six
is not pre-installed so I downloadedsix-1.14.0-py2.py3-none-any.whl
from Pypi and uploaded it to S3.pyarrow
is not pre-installed so I downloaded from Pypi the wheel filepyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl
and uploaded it to S3.
The script itself is this:
import pandas as pd
import six
import numpy
from pyarrow import *
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
df.to_parquet('test.parquet')
When I run the script adding as libraries the wheel files of six
and pyarrow
, I get the following message:
Processing ./glue-python-libs-f8nyy9el/six-1.14.0-py2.py3-none-any.whl
Installing collected packages: six
Successfully installed six-1.14.0
Processing ./glue-python-libs-f8nyy9el/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl
and the following error:
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96e10>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96c88>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96dd8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c969b0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96898>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
ERROR: Could not find a version that satisfies the requirement numpy>=1.14 (from pyarrow==0.16.0) (from versions: none)
ERROR: No matching distribution found for numpy>=1.14 (from pyarrow==0.16.0)
Traceback (most recent call last):
File "/tmp/runscript.py", line 112, in <module>
download_and_install(args.extra_py_files)
File "/tmp/runscript.py", line 62, in download_and_install
subprocess.check_call([sys.executable, "-m", "pip", "install", "--target=
{}
".format(install_path), local_file_path])
File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', '/tmp/glue-python-libs-f8nyy9el/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl']' returned non-zero exit status 1.
So at first, it seems that six is installed correctly, but then it looks like the job does not realize that numpy
is already present with a compatible version.
Then I tried to upload to S3 also the wheel file s3://risultati-navigazione-wt-ga/libs/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl
that I downloaded from Pypi. In this case I get the message:
Processing ./glue-python-libs-xzfdvgzd/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: numpy
Successfully installed numpy-1.18.2
Processing ./glue-python-libs-xzfdvgzd/six-1.14.0-py2.py3-none-any.whl
Installing collected packages: six
Successfully installed six-1.14.0
Processing ./glue-python-libs-xzfdvgzd/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl
and the error:
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
ERROR: Could not find a version that satisfies the requirement six>=1.0.0 (from pyarrow==0.16.0) (from versions: none)
ERROR: No matching distribution found for six>=1.0.0 (from pyarrow==0.16.0)
Traceback (most recent call last):
File "/tmp/runscript.py", line 112, in <module>
download_and_install(args.extra_py_files)
File "/tmp/runscript.py", line 62, in download_and_install
subprocess.check_call([sys.executable, "-m", "pip", "install", "--target=
{}
".format(install_path), local_file_path])
File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', '/tmp/glue-python-libs-xzfdvgzd/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl']' returned non-zero exit status 1.
so, this time, numpy
is recognized during the installation of pyarrow but, as far I understand, althoughsix
is installed correctly, for some reason pyarrow can't find it during the installation and indeed it tries to download from the Internet (it gets stuck a few minutes during that operation).
Can anybody help me? Thanks!