0

I need to join a hive table with JSON data from a Rest endpoint. Is it better to use a UDF or a data source (like temp table)? If using a UDF, what'd be a good way to throttle RPS?

dwong
  • 103
  • 5
  • 14

1 Answers1

1

If you want need to look up the data in the Rest endpoint and spark you likely want to look at mapParitions. Here's a good explanation here of why it could be better to use that just using map (and a UDF). It would also speaks to throttling by implication. Each partition = 1 executor. So you can set a theoretical max using this. (I say theoretical max as you aren't always guaranteed to get all the executors you wish for.)

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21