2

I have a compute-heavy Python function that wrap into a PySpark UDF and run on about 100 rows of data. When looking at CPU utilization, it looks like some worker nodes are not even utilized. I realize that this could have a multitude of reasons, and am trying to debug this.

Inside the UDF, I am already logging various statistics (e.g. start and finish time of each UDF execution). Is there any way to log the worker node ID as well? The intention being that I want to make sure that the jobs are evenly distributed between all worker nodes.

I guess IP of the worker or any other unique feature that I can log inside the UDF would work as well.

Thomas
  • 4,696
  • 5
  • 36
  • 71
  • not worker ID, but how about its network address or its hostname? You can somewhat refer them with Spark UI – pltc Sep 23 '21 at 19:00
  • 1
    Do you know how I would do that? I cannot find a single thread on stackoverflow or elsewhere where it describes how to read that info from inside the executor process... – Thomas Sep 24 '21 at 09:29
  • 1
    In Azure Databricks, I adopted this [answer](https://stackoverflow.com/a/166589/13106037) to get the IP of the worker node in an executing UDF. – fskj Sep 24 '21 at 11:39
  • 1
    @Thomas you can try this https://www.delftstack.com/howto/python/get-ip-address-python/ – pltc Sep 24 '21 at 15:26
  • thanks! One of them worked, I will add an answer for future users. – Thomas Sep 27 '21 at 15:47

1 Answers1

1

The following works:

import socket

def my_udf_func(params):
    # your code here
    host = socket.gethostname()

You can then either return host inside the return parameter (e.g. in a dict) or write it to your logs. Host name provided by databricks is the name of the cluster + the ip address of the worker node, example:

0927-152944-dorky406-10-20-136-4

10-20-136-4 in this case is the IP address.

socket.getsockname() seems to be inconsistent - I would not recommend using it.

Thomas
  • 4,696
  • 5
  • 36
  • 71