6

When a new pyspark application is started it creates a nice web UI with tabs for Jobs, Stages, Executors, etc. If I go to Executors tab I can see the full list of executors and some information about each executor - such as number of cores, storage memory used vs total, etc.

My question is if I can somehow access same information (or at least part of it) from the application itself programmatically, e.g. with something looking like spark.sparkContext.<function_name_to_get_info_about_executors>()?

I've found some workaround with doing url request in a way similar to webUI, but I think that maybe I'm missing a simpler solution.

I'm using Spark 3.0.0

Alexander Pivovarov
  • 4,850
  • 1
  • 11
  • 34

2 Answers2

5

The only way I found so far seems hacky to me and involves scraping same url as what web UI querying, i.e. doing this:

import urllib.request
import json
sc = spark.sparkContext
u = sc.uiWebUrl + '/api/v1/applications/' + sc.applicationId + '/allexecutors'
with urllib.request.urlopen(u) as url:
    executors_data = json.loads(url.read().decode())
Alexander Pivovarov
  • 4,850
  • 1
  • 11
  • 34
2

Another option is to implement a SparkListener which would override some/all onExecutor...() methods depending on your needs, and then add it during spark-submit using --conf spark.extraListeners=<your listener class>.

Your own solution is totally legit too, it just utilizes Spark's REST API.

Both are going to be quite involved, so pick your poison -- parse long JSons or go through a hierarchy of Developer API objects.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
  • I'm not sure I understand how to use a custom listener like that here. Imagine I go through trouble and write a new class like that in Java. Maybe that class could keep track of existing executors and I can enable it using `spark.extraListeners`. Then how would that help me to get this information in python? Is there an easy way to access these extra listeners via py4j or something like that? If so maybe I can access some other existing Java objects instead and skip the part where I need to implement new `SparkListener`? I imagine executor information is already available somewhere in Java. – Alexander Pivovarov Jun 23 '20 at 06:28
  • You may have seen similar SO https://stackoverflow.com/questions/44082957/how-to-add-a-sparklistener-from-pyspark-in-python. I wouldn't call it easy, but... – mazaneicha Jun 23 '20 at 10:20
  • Interesting. That explains a lot, but I still don't get it how we can make executor information accessible to main python program: if the listener is added after SparkSession was created, then it is likely that executors were already added at that point (when we register an extra listener). On the other hand if the listener was added "automatically" e.g. with the `--conf spark.extraListeners` then python program doesn't have the listener object directly (as opposed to the case where we do `listener = MyListener() ; sc._jsc.sc().addSparkListener(listener)`). – Alexander Pivovarov Jun 23 '20 at 15:06