AWS Glue Job Flow

Question

I have an ETL job in Glue that processes a very large (300M row) JDBC database table, but I really only need a subset (certain ids) from this table. When I do glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons") Does this load the entire table at this command? Is there a way to write a custom query to load only the data I need? Or if I follow this with another command say Filter or a spark SQL command on the DataFrame will that filter as the data is pulled?

Kafels · Accepted Answer · 2019-07-12T15:04:53.460

3

Well, when you run:

glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")

It only creates a Spark DF reference.

Spark works with transformations (i.e. filter, map, select) and actions (i.e. collect, count, show). You can read more about it here How Apache Spark’s Transformations And Action works, but basically, your database table only will load to memory when a action is called. This is one of many reasons Spark is so powerful and recommended to work with any size dataset.

This PDF show all transformations and actions available and some samples using them.

So yes, you need do some steps before like:

df = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons")
df = df.filter(YOUR_FILTER).select(SPECIFIC_COLS)

# Calling an action to show the filtered DF
df.show()

This will guarantee that you only load specific columns and rows to memory

edited Jul 12 '19 at 15:04

answered Jul 10 '19 at 00:55

Kafels

3,864
1
15
32

I do wonder though, will it actually edit the SQL statement and thus pull less data? Or simply pull the entire table and filter is per the transform? We are trying to limit the amount of data over the wire due to network limitations. – Gandalf Jul 11 '19 at 22:38
It depends how your code was written. I found this [answer](https://stackoverflow.com/questions/49595205/how-does-apache-spark-works-in-memory) that explains very well how to avoid load all database in memory. – Kafels Jul 12 '19 at 15:06
From that answer it appears the `.filter()` will indeed pull all the rows and filter them on the Spark side, not at the DB level - so it won't actually save any space over the wire. Looks like I need to do it at the query level somehow. Perhaps using `spark.sql(...)` – Gandalf Jul 12 '19 at 20:39
1

Actually, doing a query with `filter()` or directly in `.sql()` can do the same process. In both cases Spark will create an execute plan and optimize it before executes. To see how spark will execute your functions and queries, you should call `.explain()` method – Kafels Jul 13 '19 at 01:01
1

I mean, writing in `.sql()` not guarantee that spark will execute your query directly in database – Kafels Jul 13 '19 at 01:41

AWS Glue Job Flow

1 Answers1