I am working on setting up a Spark cluster in a multihomed network situation and have run into some problems. I'll start with the physical configuration.
I have 12 nodes all in a rack that have an inter-rack 100G infiniband network using ipoib and a 1G management network.
Spark works great when I run jobs from the master node on the cluster but now I am trying to do jobs from my workstation which is connected to the management network which is where I ran into trouble.
All of the spark nodes have their hosts file point to the infiniband network as I want them to communicate over that network. I had to set SPARK_MASTER_HOST for the master node to 0.0.0.0 in order to even be able to connect to the master from my workstation.
Now I can create a SparkSession and perform operations but it always hangs and when I look at the logs of the workers I see that they are getting a "No route to host" error. It seems that even though the default route on the node is set to the management subnet it is trying to connect back to the client using the infiniband network. (I should point out that I can ping my workstation from all of the clients so I know that the networking route is fine. Also all of the firewalls are off at the moment)
As a side note, because of this setup, the spark master web interface doesn't work very well because all of the links to the workers point to the infiniband IP address so it always fails, but if you just change the IP manually in the address bar to the correct subnet it works. This would be nice to fix as well but its not really that big of a deal.
I tried looking through the spark documentation but I didn't really find anything that looked helpful, I tried playing around with some of the network settings but I haven't had much luck. I have a hard time believing that spark doesn't support having a private network but maybe that is the case.
I appreciate any help or ideas you guys can give me.