I'm trying to do a matrix multiplication chain of size 67584*67584 using Pyspark but it constantly runs out of memory or OOM error.Here are the details: Input is matlab file(.mat file) which has the matrix in a single file. I load the file using scipy loadmat, split the file into multiple files of block size (1024*1024) and store them back in .mat format. Now mapper loads each file using filelist and create a rdd of blocks.
filelist = sc.textFile(BLOCKS_DIR + 'filelist.txt',minPartitions=200)
blocks_rdd = filelist.map(MapperLoadBlocksFromMatFile).cache()
MapperLoadBlocksFromMatFile is a function as below:
def MapperLoadBlocksFromMatFile(filename):
data = loadmat(filename)
G = data['G']
id = data['block_id'].flatten()
n = G.shape[0]
if(not(isinstance(G,sparse.csc_matrix))):
sub_matrix = Matrices.dense(n, n, G.transpose().flatten())
else:
sub_matrix = Matrices.dense(n,n,np.array(G.todense()).transpose().flatten())
return ((id[0], id[1]), sub_matrix)
Now once i have this rdd, i create a BlockMatrix from it. and Do a matrix multiplication with it.
adjacency_mat = BlockMatrix(blocks_rdd, block_size, block_size, adj_mat.shape[0], adj_mat.shape[1])
I'm using the multiply method from BlockMatrix implementation and it runs out of memory every single time.
Result = adjacency_mat.multiply(adjacency_mat)
Below are the cluster configuration details:
50 nodes of 64gb Memory and 20 cores processors.
worker-> 60gb and 16 cores
executors-> 15gb and 4 cores each
driver.memory -> 60gb and maxResultSize->10gb
i even tried with rdd.compress. Inspite of having enough memory and cores, i run out of memory every time. Every time a different node runs out of memory and i don't have an option of using visualVM in the cluster . What am i doing wrong? Is the way blockmatrix is created wrong? Or am i not accounting for enough memory? OOM Error Stacktrace