I am doing a join operation from S3 parquet data, to a JDBC (Postgres) table, using a column in the parquet data to the primary key of the JDBC table. I need a small fraction (but still an overall large number - tens or hundreds of thousands of rows overall) from the JDBC table, and then I need to intelligently partition the data for use in the executors.
I'm still new to data engineering as a whole and Spark in particular, so pardon (and assume!) my ignorance. I'm less concerned with processing time than memory usage; I have to fit the memory usage into the Amazon Glue limits.
What is a good way to do this?
My existing thoughts:
I could, in theory, construct a SQL query like:
select * from t1 where id = key1 UNION
select * from t1 where id = key2 UNION...
But, this seems silly. This question: Selecting multiple rows by ID, is there a faster way than WHERE IN gives me the idea of writing the keys I want to pull to a temporary table, joining that with the original table, and pulling the result; which seems like the "correct" way to do the above. But, this also seems like it could be a common enough problem that there's a ready-made solution I just haven't yet found.
There's also the possibility of pulling between the min/max UUID values, but then it's a question of how many extra rows I'm pulling and since the UUIDs are, AFAIK, randomly distributed throughout the possible UUID values, I expect that to get a lot of extra rows (rows that will be left out during the join). Still, this might be a useful way to partition the JDBC data.
It's also still unclear to me how the JDBC data gets to the executors; that it possibly passes (in it's entirety) through the driver process.
So, to try to formalize this into questions:
- Is there an existing recipe for this usage?
- What are features of Spark I should be looking at to accomplish this?
- What is that actual Spark data flow for data coming from a JDBC connection?