6

I am trying to access RDS Instance from AWS Glue, I have a few python scripts running in EC2 instances and I currently use PYODBC to connect, but while trying to schedule jobs for glue, I cannot import PYODBC as it is not natively supported by AWS Glue, not sure how drivers will work in glue shell as well.

EngineJanwaar
  • 422
  • 1
  • 7
  • 14

6 Answers6

7

From: Introducing Python Shell Jobs in AWS Glue announcement:

Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others.

The module list doesn't include pyodbc module, and it cannot be provided as custom .egg file because it depends on libodbc.so.2 and pyodbc.so libraries.

I think you have 2 options:

  1. Create a jdbc connection to your DB from Glue's console, and use Glue's internal methods to query it. This will require code changes of course.
  2. Use Lambda function instead. You'll need to pack pyodbc and the required libs along with your code in a zip file. Someone has already compiled those libs for AWS Lambda, see here.

Hope it helps

ya24
  • 490
  • 1
  • 4
  • 16
  • Can you include libraries like psycopg2 as custom .egg file? I was hoping to use psycopg2 + pandas to interact with the data as opposed to Glue's internal methods. – gibbz00 Jun 13 '19 at 21:15
  • 2
    Unfortunately psycopg2 is not preloaded in Glue's python shell. But you can use jdbc connection + pandas – ya24 Jun 14 '19 at 22:29
  • So only libraries are preloaded into python shell can be used? Can you not bring any non pre-loaded libraries through custom .egg file? – gibbz00 Jun 17 '19 at 02:54
  • 2
    Not with Glue's python shell. But It is possible with Glue's Spark jobs (Python/Scala) under `Security configuration, script libraries, and job parameters` > `Python library path` & `Dependent jars path` – ya24 Jun 17 '19 at 12:32
  • I posted an answer with some recent findings that might interest you. – gibbz00 Jul 16 '19 at 17:08
2

For AWS Glue use either Dataframe/DynamicFrame and specify the SQL Server JDBC driver. AWS Glue already contain JDBC Driver for SQL Server in its environment so you don't need to add any additional driver jar with glue job.

df1=spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("url", url_src).option("dbtable", dbtable_src).option("user", userID_src).option("password", password_src).load()

if you are using a SQL instead of table:

df1=spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("url", url_src).option("dbtable", ("your select statement here") A).option("user", userID_src).option("password", password_src).load()

As an alternate solution you can also use jtds driver for SQL server in your python script running in AWS Glue

Abraham
  • 423
  • 3
  • 9
  • 4
    The question is about Python shell and not Spark. Isn't it? – ya24 May 07 '19 at 17:39
  • @ya2410 To the best of my knowledge AWS Glue does not have a shell or cli interface – Abraham May 07 '19 at 17:39
  • 1
    No. I mean the type of the job he is running. When you create new job in Glue you can choose between Spark and Python shell. Your syntax is for Spark. – ya24 May 07 '19 at 17:46
  • @ya2410 I see option between SCALA and Python. No option for python shell. – Abraham May 07 '19 at 17:58
  • 1
    I stand corrected.. I see what you are talking about. I believe the syntax should work as the glue job runs on Spark on EMR. But I must admit I haven’t used it on python shell myself. – Abraham May 07 '19 at 18:02
  • Yes I was looking to do this on Python Shell since I would have preferred an easy pyodbc connection, however this is creating a dataframe, can you help me with an example of creating a dataframe by using a select query on a table in that DB ? – EngineJanwaar May 16 '19 at 11:06
  • 1
    @RakeshGuha updated the answer for "select" statement. there are alternate syntax as well which has been explained in the SPARK documentation for dataframe. – Abraham May 16 '19 at 14:15
0

If anyone needs a postgres connection with sqlalchemy using python shell, it is possible by referencing the sqlalchemy, scramp, pg8000 wheel files, it's important to reconstruct the wheel from pg8000 by eliminating the scramp dependency on the setup.py.

ignacio
  • 1,181
  • 2
  • 15
  • 28
juanjbon
  • 3
  • 2
0

I needed to so something similar and ended up creating another Glue job in Scala while using Python for everything else. I know it may not work for everyone but wanted to mention How to run DDL SQL statement using AWS Glue

mishkin
  • 5,932
  • 8
  • 45
  • 64
-1

I was able to use the python library psycopg2 even though it is not written in pure python and it does not come preloaded with aws glue python shell environment. This runs contrary to aws glue documentation. So you might be able to use odbc related python libraries in a similar way. I created .egg files for psycopg2 library and used it successfully within glue python shell environment. Following are the logs from glue python shell if you have import psycopg2 in your script and the glue job refers to the related psycopg2 .egg files.

Creating /glue/lib/installation/site.py
Processing psycopg2-2.8.3-py2.7.egg
Copying psycopg2-2.8.3-py2.7.egg to /glue/lib/installation
Adding psycopg2 2.8.3 to easy-install.pth file
Installed /glue/lib/installation/psycopg2-2.8.3-py2.7.egg
Processing dependencies for psycopg2==2.8.3
Searching for psycopg2==2.8.3
Reading https://pypi.org/simple/psycopg2/
Downloading https://files.pythonhosted.org/packages/5c/1c/6997288da181277a0c29bc39a5f9143ff20b8c99f2a7d059cfb55163e165/psycopg2-2.8.3.tar.gz#sha256=897a6e838319b4bf648a574afb6cabcb17d0488f8c7195100d48d872419f4457
Best match: psycopg2 2.8.3
Processing psycopg2-2.8.3.tar.gz
Writing /tmp/easy_install-dml23ld7/psycopg2-2.8.3/setup.cfg
Running psycopg2-2.8.3/setup.py -q bdist_egg --dist-dir /tmp/easy_install-dml23ld7/psycopg2-2.8.3/egg-dist-tmp-9qwen3l_
creating /glue/lib/installation/psycopg2-2.8.3-py3.6-linux-x86_64.egg
Extracting psycopg2-2.8.3-py3.6-linux-x86_64.egg to /glue/lib/installation
Removing psycopg2 2.8.3 from easy-install.pth file
Adding psycopg2 2.8.3 to easy-install.pth file
Installed /glue/lib/installation/psycopg2-2.8.3-py3.6-linux-x86_64.egg
Finished processing dependencies for psycopg2==2.8.3
gibbz00
  • 1,947
  • 1
  • 19
  • 31
  • where do you see these logs? I checked the error logs and regular logs in cloudwatch and didn't see anything similar. I am using an egg package like you suggest. "Continuous logging" is enabled and "Log filtering" has "no filter". – chief7 Aug 12 '19 at 19:55
  • Hi how did you create the egg file? what files did you use from the original psycopg2 github repo ? – kyle chan Nov 20 '20 at 22:04
  • I tried to do the same thing but always got `No module named 'psycopg2._psycopg'` and i am not seeing it in the github. – kyle chan Nov 20 '20 at 22:29
-1

These are the steps that I used to connect to an RDS from glue python shell job:

  1. Package up your dependency package into an egg file (these package must be pure python if I remember correctly). Put it in S3.
  2. Set your job to reference that egg file under the job configuration > Python library path
  3. Verify that your job can import the package/module
  4. Create a glue connection to your RDS (it's in Database > Tables, Connections), test the connection make sure it can hit your RDS
  5. Now in your job, you must set it to reference/use this connection. It's in the require connection as you configure your job or edit your job.

Once those steps are done and verify, you should be able to connect. In my sample I used pymysql.

noobius
  • 1,529
  • 7
  • 14