1

I want to import pyarrow in a Python shell Glue script because I need to export a dataframe as parquet (i.e. with DataFrame.to_parquet()).

The way to add custom dependencies suggested in the AWS docs is to use .egg or .whl files (https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html#create-python-extra-library).

The library pyarrow has numpy and six as dependencies:

  • numpy is already pre-installed on Glue, with version 1.16.2 as I checked with a simple print(numpy.version.version)

  • six is not pre-installed so I downloaded six-1.14.0-py2.py3-none-any.whl from Pypi and uploaded it to S3.

  • pyarrow is not pre-installed so I downloaded from Pypi the wheel file pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl and uploaded it to S3.

The script itself is this:

import pandas as pd
import six
import numpy
from pyarrow import *

data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
df.to_parquet('test.parquet')

When I run the script adding as libraries the wheel files of six and pyarrow, I get the following message:

Processing ./glue-python-libs-f8nyy9el/six-1.14.0-py2.py3-none-any.whl
Installing collected packages: six
Successfully installed six-1.14.0
Processing ./glue-python-libs-f8nyy9el/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl

and the following error:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96e10>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96c88>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96dd8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c969b0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d78c96898>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/numpy/
ERROR: Could not find a version that satisfies the requirement numpy>=1.14 (from pyarrow==0.16.0) (from versions: none)
ERROR: No matching distribution found for numpy>=1.14 (from pyarrow==0.16.0)
Traceback (most recent call last):
  File "/tmp/runscript.py", line 112, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 62, in download_and_install
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--target=
{}
".format(install_path), local_file_path])
  File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', '/tmp/glue-python-libs-f8nyy9el/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl']' returned non-zero exit status 1.

So at first, it seems that six is installed correctly, but then it looks like the job does not realize that numpy is already present with a compatible version.

Then I tried to upload to S3 also the wheel file s3://risultati-navigazione-wt-ga/libs/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl that I downloaded from Pypi. In this case I get the message:

Processing ./glue-python-libs-xzfdvgzd/numpy-1.18.2-cp36-cp36m-manylinux1_x86_64.whl
Installing collected packages: numpy
Successfully installed numpy-1.18.2
Processing ./glue-python-libs-xzfdvgzd/six-1.14.0-py2.py3-none-any.whl
Installing collected packages: six
Successfully installed six-1.14.0
Processing ./glue-python-libs-xzfdvgzd/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl

and the error:

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fca02861cc0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
ERROR: Could not find a version that satisfies the requirement six>=1.0.0 (from pyarrow==0.16.0) (from versions: none)
ERROR: No matching distribution found for six>=1.0.0 (from pyarrow==0.16.0)
Traceback (most recent call last):
  File "/tmp/runscript.py", line 112, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 62, in download_and_install
    subprocess.check_call([sys.executable, "-m", "pip", "install", "--target=
{}
".format(install_path), local_file_path])
  File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', '/tmp/glue-python-libs-xzfdvgzd/pyarrow-0.16.0-cp36-cp36m-manylinux2014_x86_64.whl']' returned non-zero exit status 1.

so, this time, numpy is recognized during the installation of pyarrow but, as far I understand, althoughsix is installed correctly, for some reason pyarrow can't find it during the installation and indeed it tries to download from the Internet (it gets stuck a few minutes during that operation).

Can anybody help me? Thanks!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
elena
  • 11
  • 1
  • 3

3 Answers3

1

Worked for me when I uploaded this linux version of pyarrow to S3 and it's path to the Python library path box in the edit job modal: https://files.pythonhosted.org/packages/04/57/f9a96302f27f0008f5afcd4232d4df66a6af5e568445128cac52f64ee4fd/pyarrow-3.0.0-cp36-cp36m-manylinux2010_x86_64.whl

You could use easy-install like Mauro's answer above^ but I believe that will be deprecated sometime soon.

Ravmcgav
  • 183
  • 1
  • 1
  • 11
0

To use pyarrow it is neccessary to build your own package. In setup.py you have to specify pyarrow as one of install_requires.

setup(
    name="PACKAGE_NAME",
    version="0.1",
    packages=find_packages(),
    install_requires=[
        'pyarrow'
    ],
)

Here you find more information about building a package for AWS glue.

According to the documentation also pandas should be added in this way.

pandas (required to be installed via the python setuptools configuration, setup.py)

torm
  • 1,486
  • 15
  • 25
0

Update: This is not valid anymore and libs must be added with the WHL.


Previous answare: Try to add:

import os
import site
import importlib
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

libraries = ["pyarrow"]
for lib in libraries:
    easy_install.main( ["--install-dir", install_path, lib] )

importlib.reload(site)

at the beginning of the Glue Python Shell code. This seems to work better than adding .whl libs in the Python library path.

Mauro Mascia
  • 401
  • 5
  • 15