Log worker ID when using PySpark UDF

Question

I have a compute-heavy Python function that wrap into a PySpark UDF and run on about 100 rows of data. When looking at CPU utilization, it looks like some worker nodes are not even utilized. I realize that this could have a multitude of reasons, and am trying to debug this.

Inside the UDF, I am already logging various statistics (e.g. start and finish time of each UDF execution). Is there any way to log the worker node ID as well? The intention being that I want to make sure that the jobs are evenly distributed between all worker nodes.

I guess IP of the worker or any other unique feature that I can log inside the UDF would work as well.

not worker ID, but how about its network address or its hostname? You can somewhat refer them with Spark UI — pltc, Sep 23 '21 at 19:00
Do you know how I would do that? I cannot find a single thread on stackoverflow or elsewhere where it describes how to read that info from inside the executor process... — Thomas, Sep 24 '21 at 09:29
In Azure Databricks, I adopted this [answer](https://stackoverflow.com/a/166589/13106037) to get the IP of the worker node in an executing UDF. — fskj, Sep 24 '21 at 11:39
@Thomas you can try this https://www.delftstack.com/howto/python/get-ip-address-python/ — pltc, Sep 24 '21 at 15:26
thanks! One of them worked, I will add an answer for future users. — Thomas, Sep 27 '21 at 15:47

score 1 · Accepted Answer · answered Sep 27 '21 at 15:52

The following works:

import socket

def my_udf_func(params):
    # your code here
    host = socket.gethostname()

You can then either return host inside the return parameter (e.g. in a dict) or write it to your logs. Host name provided by databricks is the name of the cluster + the ip address of the worker node, example:

0927-152944-dorky406-10-20-136-4

10-20-136-4 in this case is the IP address.

socket.getsockname() seems to be inconsistent - I would not recommend using it.

Log worker ID when using PySpark UDF

1 Answers1