I have created a cluster with 1 master (clus-m) and two worker nodes(clus-w-0, clus-w-1) in gcp dataproc. Now using pyspark rdd, I want to distribute one task so that all the nodes get involved. Below is my code snippet.
def pair_dist(row):
dissimlarity = 0
Z = row[0].split(',')
X = row[1].split(',')
for j in range(len(Z)):
if Z[j] != X[j]:
dissimlarity += 1
return str(dissimlarity) + **os.uname()[1]**
sc = SparkContext.getOrCreate()
rdd = sc.textFile( "input.csv" )
rdd = sc.parallelize(rdd.take(5))
rdd = rdd.cartesian(rdd)
dist = rdd.map(lambda x: pair_dist(x)).collect()
dist = np.array(dist).reshape((5,5))
print(dist)
sc.stop()
To check whether it is happened properly or not I put the host name with the result. But I always get the host name clus-m in result not the worker nodes' host name.
Output: [0clus-m 2clus-m...... 1clus-m 0clus-m.......] 5x5
Please suggest what exactly I need to do?
def hostname(row): return os.uname()[1] data = pd.read_csv("input.csv", header=None) data = numpy.array(data) rdd = sc.parallelize(data) rdd = rdd.repartition(3) print(rdd.getNumPartitions()) rdd = rdd.cartesian(rdd) print(rdd.count()) dist = rdd.map(lambda x: hostname(x)).collect() dist = np.array(dist).reshape((120,120)) print(dist)
– srimanta Mar 05 '20 at 12:44