PySpark Block Matrix multiplication fails with OOM

Question

I'm trying to do a matrix multiplication chain of size 67584*67584 using Pyspark but it constantly runs out of memory or OOM error.Here are the details: Input is matlab file(.mat file) which has the matrix in a single file. I load the file using scipy loadmat, split the file into multiple files of block size (1024*1024) and store them back in .mat format. Now mapper loads each file using filelist and create a rdd of blocks.

filelist = sc.textFile(BLOCKS_DIR + 'filelist.txt',minPartitions=200)
blocks_rdd = filelist.map(MapperLoadBlocksFromMatFile).cache()

MapperLoadBlocksFromMatFile is a function as below:

 def MapperLoadBlocksFromMatFile(filename):
         data = loadmat(filename)
         G = data['G']
         id = data['block_id'].flatten()
         n = G.shape[0]
         if(not(isinstance(G,sparse.csc_matrix))):
                 sub_matrix = Matrices.dense(n, n, G.transpose().flatten())
         else:
                 sub_matrix = Matrices.dense(n,n,np.array(G.todense()).transpose().flatten())
     return ((id[0], id[1]), sub_matrix)

Now once i have this rdd, i create a BlockMatrix from it. and Do a matrix multiplication with it.

adjacency_mat = BlockMatrix(blocks_rdd, block_size, block_size, adj_mat.shape[0], adj_mat.shape[1])

I'm using the multiply method from BlockMatrix implementation and it runs out of memory every single time.

Result = adjacency_mat.multiply(adjacency_mat)

Below are the cluster configuration details:

50 nodes of 64gb Memory and 20 cores processors.
worker-> 60gb and 16 cores
executors-> 15gb and 4 cores each
driver.memory -> 60gb and maxResultSize->10gb

i even tried with rdd.compress. Inspite of having enough memory and cores, i run out of memory every time. Every time a different node runs out of memory and i don't have an option of using visualVM in the cluster . What am i doing wrong? Is the way blockmatrix is created wrong? Or am i not accounting for enough memory? OOM Error Stacktrace

Try to give less memory for the worker and executors (BTW what's the difference between them? :P ). Yes I said less... — gsamaras, Sep 08 '16 at 00:05
a worker can host multiple executors. So, in this case, worker which has 60gb and 16 core, can host upto 4 executors since each executor has 15gb and 4core for them. And about the less memory, i did try the less memory option.Didn't work either — Vinodh Paramesh, Sep 08 '16 at 00:12
Oh yeah correct! Damn I am too tired...If you see memoryOverhead issues, [here is how I did it](https://gsamaras.wordpress.com/code/memoryoverhead-issue-in-spark/), but I really can't help now, must rest. Good luck! — gsamaras, Sep 08 '16 at 00:16
Thanks for sharing the blog.Though I'm not using yarn, i ll give it a try reducing executor memory to account for python processes. Oh ya and u definitely should rest ! — Vinodh Paramesh, Sep 08 '16 at 00:45
There is [another post](http://stackoverflow.com/questions/32336915/pyspark-java-lang-outofmemoryerror-java-heap-space) which looks related to your question. — Axel Kemper, Sep 08 '16 at 19:45
@AxelKemper Thanks for pointing. Yup, the post is related but I was wondering if the solution to my problem is specific to something with BlockMatrix. I indeed try the settings that post suggested — Vinodh Paramesh, Sep 09 '16 at 22:08

PySpark Block Matrix multiplication fails with OOM

0 Answers0