16

What port should I use to access the Spark UI on Google Dataproc?

I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln

Firewall is properly configured.

BAR
  • 15,909
  • 27
  • 97
  • 185

4 Answers4

26

Dataproc runs Spark on top of YARN, so you won't find the typical "Spark standalone" ports; instead, when running a Spark job, you can visit port 8088 which will show you the YARN ResourceManager's main page. Any running Spark jobs will be accessible through the Application Master link on that page. The Spark Application Master's page looks the same as the familiar Spark-standalone landing page that you would normally find on port 8080 for default Spark setups.

Since workers check in over the internal network, YARN's links will be using cluster-internal hostnames (the hostnames should include your Dataproc cluster name as a prefix), but this means if you're accessing from the outside network, the links may not work at first; you have to replace the hostname with the external IP address if you're using the firewall-based approach.

An easier experience will be to use the SOCKS proxy approach as explained here: https://cloud.google.com/dataproc/cluster-web-interfaces

In that case, simply using gcloud compute ssh to run a lightweight local socks proxy and then opening a browser pointed at that will let you click all the YARN links as normal.

Dennis Huo
  • 10,517
  • 27
  • 43
3

When following the instructions in Dennis's answer, I found that I could not connect to ports 8080 or 8088 for dataproc image v1.0.

The open ports on the master node suggested to use 18080, which I did following the documentation for port 18080 and voilá: Access to webui.

Shog9
  • 156,901
  • 35
  • 231
  • 235
Frank
  • 406
  • 2
  • 13
0

Since I had public addresses in my DataProc cluster I created a Firewall rule in Cloud Console from my corporate subnet to DataProc instances ports 8088 (YARN RM) and 8042 (NM Webapp address).

gogasca
  • 9,283
  • 6
  • 80
  • 125
0

For accessing the Spark UI you can follow these steps:

  1. First, don't forget to mark the Enable component gateway option in the Components section of the Set up cluster tab while creating the Dataproc cluster.
  2. After the cluster is running, go to the Dataproc page in the GCP Console: Dataproc -> Clusters
  3. Find the cluster in the list and click on its link under the Name column for accessing the Cluster Details page.
  4. In the Cluster Details page click on the Web Interfaces tab.
  5. In the Web Interfaces tab click on the YARN ResourceManager link.
  6. The browser will open a new page for the YARN ResourceManager showing all available applications.
  7. Find the application with the type SPARK and the state RUNNING in the list/table on the YARN ResourceManager page.
  8. Scroll right and click on the link ApplicationMaster in the Tracking UI column.
  9. After this, you'll be forwarded to the Spark UI page.
Juarez Rudsatz
  • 369
  • 3
  • 7