After having run PolynomialExpansion on a Pyspark dataframe, I have a data frame (polyDF) that looks like this:
+--------------------+--------------------+
| features| polyFeatures|
+--------------------+--------------------+
|(81,[2,9,26],[13....|(3402,[5,8,54,57,...|
|(81,[4,16,20,27,3...|(3402,[14,19,152,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
|(81,[4,27],[1.0,1...|(3402,[14,19,405,...|
The "features" column includes the features included in the original data. Each row represents a different user. There are 81 total possible features for each user in the original data. The "polyFeatures" column includes the features after the polynomial expansion has been run. There are 3402 possible polyFeatures after running PolynomialExpansion. So what each row of both columns contain are:
- An integer representing the number of possible features (each user may or may not have had a value in each of the features).
- A list of integers that contains the feature indexes for which that user had a value.
- A list of numbers that contains the values of each of the features mentioned in #2 above.
My question is, how can I take these two columns, create two sparse matrices, and subsequently join them together to get one, full, sparse Pyspark matrix? Ideally it would look like this:
+---+----+----+----+------+----+----+----+----+---+---
| 1 | 2 | 3 | 4 | ... |405 |406 |407 |408 |409|...
+---+----+----+----+------+----+----+----+----+---+---
| 0 | 13 | 0 | 0 | ... | 0 | 0 | 0 | 6 | 0 |...
| 0 | 0 | 0 | 9 | ... | 0 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
| 0 | 0 | 0 | 1.0| ... | 3 | 0 | 0 | 0 | 0 |...
I have reviewed the Spark documentation for PolynomialExpansion located here but it doesn't cover this particular issue. I have also tried to apply the SparseVector class which is documented here, but this seems to be useful for only one vector rather than a data frame of vectors.
Is there an effective way to accomplish this?