I have a compute-heavy Python function that wrap into a PySpark UDF and run on about 100 rows of data. When looking at CPU utilization, it looks like some worker nodes are not even utilized. I realize that this could have a multitude of reasons, and am trying to debug this.
Inside the UDF, I am already logging various statistics (e.g. start and finish time of each UDF execution). Is there any way to log the worker node ID as well? The intention being that I want to make sure that the jobs are evenly distributed between all worker nodes.
I guess IP of the worker or any other unique feature that I can log inside the UDF would work as well.