I need to join a hive table with JSON data from a Rest endpoint. Is it better to use a UDF or a data source (like temp table)? If using a UDF, what'd be a good way to throttle RPS?
Asked
Active
Viewed 159 times
1 Answers
1
If you want need to look up the data in the Rest endpoint and spark you likely want to look at mapParitions
. Here's a good explanation here of why it could be better to use that just using map (and a UDF). It would also speaks to throttling by implication. Each partition = 1 executor. So you can set a theoretical max using this. (I say theoretical max as you aren't always guaranteed to get all the executors you wish for.)

Matt Andruff
- 4,974
- 1
- 5
- 21