5

I'm trying to run a code that uses psycopg2 to manipulate a Redshift instance. I have tried by importing a wheel file as I see they are supported in Glue python jobs. I see the library is installed in the endpoint when running but then I get an error:

import boto3
import psycopg2
Aug 4, 2020, 1:24:06 PM Pending execution
Processing ./glue-python-libs-92ng4pcb/psycopg2-2.8.5-cp36-cp36m-win_amd64.whl
Installing collected packages: psycopg2
Successfully installed psycopg2-2.8.5
Considering file without prefix as a python extra file s3://gluelibraries/boto3.zip
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

2020-08-04T13:24:44.831+02:00
Traceback (most recent call last):
  File "/tmp/runscript.py", line 123, in <module>
    runpy.run_path(temp_file_path, run_name='__main__')
  File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/glue-python-scripts-1t08aq9n/postloading.py", line 6, in <module>
  File "/glue/lib/installation/psycopg2/__init__.py", line 51, in <module>
    from psycopg2._psycopg import (                     # noqa
ModuleNotFoundError: No module named 'psycopg2._psycopg'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-1t08aq9n/postloading.py", line 6, in <module>
  File "/glue/lib/installation/psycopg2/__init__.py", line 51, in <module>
    from psycopg2._psycopg import (                     # noqa
ModuleNotFoundError: No module named 'psycopg2._psycopg'

Theoretically Glue jobs in python (contrary to pyspark jobs) should support non pure python libraries

user2728349
  • 139
  • 1
  • 3
  • 12

6 Answers6

7

based on https://stackoverflow.com/a/58305654/4725074

Install psycopg2-binary into a directory and zip up the contents of that directory:

mkdir psycopg2-binary
cd psycopg2-binary
pip install psycopg2-binary -t  .
# in case using python3:
# python3 -m pip install --system psycopg2-binary -t  .
zip -r9 psycopg2.zip *

I then copied psycopg2.zip to an S3 bucket and add it as an extra Python library under "Python library path" in the Glue Spark job.

I then launched the job with the following script to verify if psycopg2 is present (the zip file will be downloaded by Glue into the directory in which the Job script is located)

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import sys
import os
import zipfile

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

zip_ref = zipfile.ZipFile('./psycopg2.zip', 'r')
print os.listdir('.')
zip_ref.extractall('/tmp/packages')
zip_ref.close()
sys.path.insert(0, '/tmp/packages')

import psycopg2
print(psycopg2.__version__)

job.commit()

This worked for me.

Vzzarr
  • 4,600
  • 2
  • 43
  • 80
  • 1
    Thank you ! This should be accepted answer – Sandeep Singh Apr 29 '21 at 20:45
  • 1
    The currently accepted answer (using the --additional-python-modules flag) is good for most scenarios. However, I work in an environment where egress is highly restricted and this answer is the more workable. – Ben Dalling Aug 16 '22 at 08:51
4

Now with Glue Version 2 you can pass in python libraries as parameters to Glue Jobs. I used pyscopg2-binary instead of pyscopg2 and it worked for me. Then in the code I did import psycopg2.

--additional-python-modules

johnhill2424
  • 108
  • 1
  • 4
  • thanks for providing this solution. I encountered the same issue with my glue job. Just add the job parameter --additional-python-modules=psycopg2-binary and problem is solved. I think this solution is better than getting the [psycopg2 library](https://github.com/jkehler/awslambda-psycopg2) from github which did not work for me. Beside glue changes the python version from 3.6 to 3.7.9 already. You don't want to worry when glue version changes and you need to change the library again. – Raymond Apr 26 '21 at 22:55
  • Having the same issue, I've added this parameter but aws glue can't install the module. Error: pg_config executable not found. pg_config is required to build psycopg2 from source. Please add the directory containing pg_config to the $PATH or specify the full executable path with the option: python setup.py build_ext --pg-config /path/to/pg_config build ... or with the pg_config option in 'setup.cfg'. If you prefer to avoid building psycopg2 from source, please install the PyPI 'psycopg2-binary' package instead. – Yousra Jun 30 '21 at 17:17
  • it worked for me but now when I tested it again after few days it's giving error no module found name psycopg2. – Abdul Haseeb Sep 16 '21 at 09:45
  • See this page: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html – Luciano Aug 04 '22 at 17:25
1

I have faced the similar issue with psycopg2 package. It is to do with the compatibility with Python runtime that is accessing the psycopg2 module.

Follow this thread . Hope you'll have your solution. Using psycopg2 with Lambda to Update Redshift (Python)

sumanth shetty
  • 1,851
  • 5
  • 24
  • 57
1

Instead of psycopg2, try using pg8000 which is easy to install and it doesn't have c dependencies. Also, it is used by amazon in most of their internal projects.

Shubham Jain
  • 5,327
  • 2
  • 15
  • 38
0

After having tried with pg8000 with a Python endpoint I got the following error:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 123, in <module>
    runpy.run_path(temp_file_path, run_name='__main__')
  File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/glue-python-scripts-j7khvbvv/postloading.py", line 7, in <module>
ModuleNotFoundError: No module named 'pg8000'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-j7khvbvv/postloading.py", line 7, in <module>
ModuleNotFoundError: No module named 'pg8000'

when using a pyspark endpoint I don't have this problem with the pg8000

user2728349
  • 139
  • 1
  • 3
  • 12
0

I download wheel from this link with name psycopg2-2.9.1-cp36-cp36m-linux_x86_64.whl and problem was solved. Thanks

Abdul Haseeb
  • 442
  • 4
  • 22