I have created a 7 nodes cluster on dataproc (1 master and 6 executors. 3 primary executors and 3 secondary preemptible executors). I can see in the console the cluster is created corrected. I have all 6 ips and VM names. I am trying to test the cluster but it seems the code is not running on all the executors but just 2 at max. Following is the code I am using to check the number of executors that the code executed on:
import numpy as np
import socket
set(sc.parallelize(range(1,1000000)).map(lambda x : socket.gethostname()).collect())
output:
{'monsoon-testing-sw-543d', 'monsoon-testing-sw-p7w7'}
I have restarted the kernel many times but, though the executors change the number of executors on which the code is executed remains the same.
Can somebody help me understand what is going on here and why pyspark is not parallelizing my code to all the executors?