1

I am trying to provide my custom python code which requires libraries that are not supported by AWS(pandas). So, I created a zip file with the necessary libraries and uploaded it to the S3 bucket. While running the job, I pointed the path of S3 bucket in the advanced properties.Still my job is not running successfully. Can anyone suggest why? 1.Do I have to include my code in the zip file? If yes then how will Glue understand that it's the code? 2. Also do I need to create a package or just zip file will do? Appreciate the help!

Emma
  • 353
  • 3
  • 7
  • 16
  • how are they not supported? Why cant you install them with PIP? – Fallenreaper Jul 06 '18 at 04:37
  • i read this in aws glue doc> 'Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.' can you please elaborate how do i do pip with aws? – Emma Jul 06 '18 at 04:47

3 Answers3

2

According to AWS Glue Documentation:

Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

I think it wouldn't work even if we upload the python library as a zip file, if the library you are using has a dependency for C extensions. I had tried using Pandas, Holidays, etc the same way you have tried, and on contacting AWS Support, they mentioned it is in their to do list (support for these python libaries), but no ETA as of now.

So, any libraries that are not native python, would not work in AWS Glue, at this point. But should be available in the near future, since this is a popular demand.

If still you would like to try it out, please refer to this link, where its explained how to package the external libraries to run in AWS glue, I tried it but didnt work for me.

Yuva
  • 2,831
  • 7
  • 36
  • 60
2

An update on AWS Glue Jobs released on 22nd Jan 2019.

Introducing Python Shell Jobs in AWS Glue -- Posted On: Jan 22, 2019

Python shell jobs in AWS Glue support scripts that are compatible with Python 2.7 and come pre-loaded with libraries such as the Boto3, NumPy, SciPy, pandas, and others. You can run Python shell jobs using 1 DPU (Data Processing Unit) or 0.0625 DPU (which is 1/16 DPU). A single DPU provides processing capacity that consists of 4 vCPUs of compute and 16 GB of memory.

More info at : https://aws.amazon.com/about-aws/whats-new/2019/01/introducing-python-shell-jobs-in-aws-glue/

https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html

Yuva
  • 2,831
  • 7
  • 36
  • 60
  • do you understand the difference between those two types of jobs that aws glue provides ? (spark - python shell) – Hasan Jawad Feb 06 '19 at 13:07
  • 1
    yeah, one significant difference is it runs on a single DPU instance, and you cannot run them on multiple DPUs. Another difference is that python shell do not seem to accept zip files for other 3rd party python libraries. We need to create an egg file, and upload to AWS Glue (S3) instead of zip files. Haven't got much time to check the performance of python shell though. – Yuva Feb 07 '19 at 02:20
  • So in theory, you can do fully fledged etl jobs on a python shell ? Same job complexity as you would do on spark jobs but on smaller datasets ! – Hasan Jawad Feb 07 '19 at 13:52
0

As Yuva's answer mentioned, I believe it's currently impossible to import a library that is not purely in Python and the documentation reflects that.

However, in case someone came here looking for an answer on how to import a python library in AWS Glue in general, there is a good explanation in this post on how to do it with the pg8000 library: AWS Glue - Truncate destination postgres table prior to insert

Nikolay D
  • 329
  • 3
  • 11