1

I'm trying to run unit tests on my pyspark scripts locally so that I can integrate this into our CI.

$ pyspark
...
>>> import pandas as pd
>>> df = pd.DataFrame([(1,2,3), (4,5,6)])
>>> df
   0  1  2
0  1  2  3
1  4  5  6

As per the documentation, I should be able to convert using the following:

from awsglue.dynamicframe import DynamicFrame
dynamic_frame = DynamicFrame.fromDF(dataframe, glue_ctx, name)

But when I try to convert to a DynamicFrame I get errors when trying to instantiate the gluecontext

$ pyspark
>>> from awsglue.context import GlueContext
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/awsglue/context.py", line 43, in __init__
    self._glue_scala_context = self._get_glue_scala_context(**options)
  File "/Library/Python/2.7/site-packages/awsglue/context.py", line 63, in _get_glue_scala_context
    return self._jvm.GlueContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable

How do I get this working WITHOUT using AWS Glue Dev Endpoints? I don't want to be charged EVERY TIME I commit my code. that's absurd.

JonTroncoso
  • 791
  • 1
  • 8
  • 22

2 Answers2

1

Why do you want to convert from dataframe to DynamicFrame as you can't do unit testing using Glue APIs - No mocks for Glue APIs?

I prefer following approach:

  1. Write two files per glue job - job_glue.py and job_pyspark.py
  2. Write Glue API specific code in job_glue.py
  3. Write non-glue api specific code job_pyspark.py
  4. Write pytest test-cases to test job_pyspark.py
Sandeep Fatangare
  • 2,054
  • 9
  • 14
  • Honestly, I'm as new to python as I am glue. So, I don't know which is which. I ended up creating an anonymous object (`type('', (object,), value)`) and just throwing that in the map function referenced by my pyspark script. That's the only thing I tested. But that seems like the only thing I can test. Absolutely stupid that they expect people to pay for a dev environment without providing ways of mocking their SDK. – JonTroncoso Dec 28 '18 at 05:02
  • 1
    Anything you are doing using dataframe is pyspark. Anything you are doing using dynamic frame is glue – Sandeep Fatangare Dec 28 '18 at 05:42
  • That actually adds a lot of clarity. AWS Glue created a template for me that included just about everything for taking data from files A to database B. so I just added the one line about mapping through my mapping function. I'm not sure why the default is dynamicframe. – JonTroncoso Dec 29 '18 at 02:39
  • Dynamicframe has few advantages over dataframe. Reference: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html Sadly, Glue has very limited APIs which work directly on dynamicframe. In most of scenarios, dynamicframe should be converted to dataframe to use pyspark APIs. I hope, Glue will provide more API support in future in turn reducing unnecessary conversion to dataframe. – Sandeep Fatangare Dec 29 '18 at 18:46
0

I think present there is no other alternate option for us other than using glue. For reference:Can I test AWS Glue code locally?

TEJASWAKUMAR
  • 85
  • 1
  • 3
  • 8