0

After having run PolynomialExpansion on a Pyspark dataframe, I have a data frame (polyDF) that looks like this:

+--------------------+--------------------+
|            features|        polyFeatures|
+--------------------+--------------------+
|(81,[2,9,26],[13....|(3402,[5,8,54,57,...|
|(81,[4,16,20,27,3...|(3402,[14,19,152,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|

The "features" column includes the features included in the original data. Each row represents a different user. There are 81 total possible features for each user in the original data. The "polyFeatures" column includes the features after the polynomial expansion has been run. There are 3402 possible polyFeatures after running PolynomialExpansion. So what each row of both columns contain are:

  1. An integer representing the number of possible features (each user may or may not have had a value in each of the features).
  2. A list of integers that contains the feature indexes for which that user had a value.
  3. A list of numbers that contains the values of each of the features mentioned in #2 above.

My question is, how can I take these two columns, create two sparse matrices, and subsequently join them together to get one, full, sparse Pyspark matrix? Ideally it would look like this:

+---+----+----+----+------+----+----+----+----+---+---
| 1 | 2  | 3  | 4  |  ... |405 |406 |407 |408 |409|...
+---+----+----+----+------+----+----+----+----+---+---
| 0 | 13 | 0  | 0  | ...  | 0  | 0  | 0  | 6  | 0 |...
| 0 | 0  | 0  | 9  | ...  | 0  | 0  | 0  | 0  | 0 |...
| 0 | 0  | 0  | 1.0| ...  | 3  | 0  | 0  | 0  | 0 |...
| 0 | 0  | 0  | 1.0| ...  | 3  | 0  | 0  | 0  | 0 |...

I have reviewed the Spark documentation for PolynomialExpansion located here but it doesn't cover this particular issue. I have also tried to apply the SparseVector class which is documented here, but this seems to be useful for only one vector rather than a data frame of vectors.

Is there an effective way to accomplish this?

Naim
  • 31
  • 7
  • Thousands of columns won't work well. Rather take a look at the `VectorAssembler`: http://stackoverflow.com/q/33273712/1560062 – zero323 Apr 19 '17 at 19:00
  • @zero323, thanks for the link. I looked through the Scala example it provides, and then went through the Pyspark example it links to (http://stackoverflow.com/questions/32982425/encode-and-assemble-multiple-features-in-pyspark), but both of those examples seem to be doing the opposite of what I need. I ran the Vector Assembler previously, which has brought me to the point I"m currently at. Now I need to take those vectors, and convert them to sparse matrices. Is there a way to do that, perhaps some sort of VectorDisAssembler? – Naim Apr 19 '17 at 20:47
  • Once you assemble you can convert to `RowMatrix` or `IndexedRowMatrix`. – zero323 Apr 19 '17 at 22:10
  • I guess your question is well answered [here](https://stackoverflow.com/questions/38384347/how-to-split-vector-into-columns-using-pyspark). – Manrique Nov 19 '18 at 15:58

0 Answers0