pyspark: Convert sparse local matrix to an RDD

Question

I have a sparse matrix (which I receive from a python function) which I want to convert to a numpy matrix. The numpy matrix will not fit in the local RAM and I want to covet it into an RDD in pySpark. I am not tha familiar with Spark in general and so I do not know how to load a local spase matrix into an RDD.

I don't think there is an easy way, there's no `sparse matrix` in `pyspark`. You could write the matrix into a file, see [this SO question](http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format) and [this numpy cookbook](https://redmine.epfl.ch/projects/python_cookbook/wiki/Sparse_matrix_format_conversion). Then you should read it in a dataframe with something like [this](https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema), however converting the format from the `sparse matrix`. Hope it helps. — lrnzcig, Jun 17 '15 at 16:55
Just saw the new version of pyspark (1.4.1) and the adde sparse matrices. But I already found a work around. I didn't want to save to a file and then read it again. I converted to a COO sparse matrix which can give a simple representation of the sparse matrix locally ( coordinates with values) and parallelized the coordinates with the data. this way I can map and aggregate the data by row (which works partially for what I need it). In any matter the sparse matrix support should work as well. — user1676389, Jun 18 '15 at 10:39

score 1 · Answer 1 · answered Jun 18 '15 at 10:43

1

This question was submitted with "pre 1.4.1 Spark knowledge". Apparently sparse matrices have been added in the spark library. Spark SparseMatrix

answered Jun 18 '15 at 10:43

user1676389

73
10

pyspark: Convert sparse local matrix to an RDD

1 Answers1