SparkSQL join data with Rest API data

Question

I need to join a hive table with JSON data from a Rest endpoint. Is it better to use a UDF or a data source (like temp table)? If using a UDF, what'd be a good way to throttle RPS?

score 1 · Accepted Answer · answered Oct 20 '21 at 17:18

If you want need to look up the data in the Rest endpoint and spark you likely want to look at mapParitions. Here's a good explanation here of why it could be better to use that just using map (and a UDF). It would also speaks to throttling by implication. Each partition = 1 executor. So you can set a theoretical max using this. (I say theoretical max as you aren't always guaranteed to get all the executors you wish for.)

SparkSQL join data with Rest API data

1 Answers1