I have a sparse matrix (which I receive from a python function) which I want to convert to a numpy matrix. The numpy matrix will not fit in the local RAM and I want to covet it into an RDD in pySpark. I am not tha familiar with Spark in general and so I do not know how to load a local spase matrix into an RDD.
Asked
Active
Viewed 1,656 times
1
-
I don't think there is an easy way, there's no `sparse matrix` in `pyspark`. You could write the matrix into a file, see [this SO question](http://stackoverflow.com/questions/8955448/save-load-scipy-sparse-csr-matrix-in-portable-data-format) and [this numpy cookbook](https://redmine.epfl.ch/projects/python_cookbook/wiki/Sparse_matrix_format_conversion). Then you should read it in a dataframe with something like [this](https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema), however converting the format from the `sparse matrix`. Hope it helps. – lrnzcig Jun 17 '15 at 16:55
-
Just saw the new version of pyspark (1.4.1) and the adde sparse matrices. But I already found a work around. I didn't want to save to a file and then read it again. I converted to a COO sparse matrix which can give a simple representation of the sparse matrix locally ( coordinates with values) and parallelized the coordinates with the data. this way I can map and aggregate the data by row (which works partially for what I need it). In any matter the sparse matrix support should work as well. – user1676389 Jun 18 '15 at 10:39
-
Thanks for the update! – lrnzcig Jun 18 '15 at 13:44
1 Answers
1
This question was submitted with "pre 1.4.1 Spark knowledge". Apparently sparse matrices have been added in the spark library. Spark SparseMatrix

user1676389
- 73
- 10