How to create arrays from non-zero DataFrame elements in pyspark

Asked May 10 '16 at 18:13

Active May 10 '16 at 18:13

Viewed 451 times

Is there an inverse function for pyspark.sql.functions.explode? Rather than exploding an array into separate rows in an Apache Spark DataFrame, I need to create arrays based on the non-zero elements stored in a DataFrame.

Input: DataFrame with columns (key1, key2, array_index, array_value)
Output: DataFrame with columns (key1, key2, array[]), reducing by the (key1, key2) pairs.

I want to make sure this mapping function is carried out in a distributed fashion on the worker nodes and not in serial fashion on the driver node. The general approach suggested at https://blogs.msdn.microsoft.com/azuredatalake/2016/02/10/pyspark-appending-columns-to-dataframe-when-dataframe-withcolumn-cannot-be-used/ looks promising, but I wasn't sure of the best way to address my array creation problem.

asked May 10 '16 at 18:13

Dr. Steve Kramer

What does `array[]` contain? – Alberto Bonsanto May 10 '16 at 18:17
1

check [this](http://stackoverflow.com/questions/37099715/pyspark-1-6-1-sql-dataframe-column-to-vector-aggregation-without-hive) and comments... – MaxU - stand with Ukraine May 10 '16 at 18:22
@AlbertoBonsanto, the `array[]` will contain the array_values (doubles) in the positions specified by the array_index integer values, with zeros everywhere else. It looks like I could pass a list of (array_index, array_value) pairs to MLLibs' SparseVector class, but I wasn't sure how aggregate the DataFrame rows into such a list. – Dr. Steve Kramer May 11 '16 at 21:20
@MaxU, thanks for the link. I'll check those options. – Dr. Steve Kramer May 11 '16 at 21:21

How to create arrays from non-zero DataFrame elements in pyspark

0 Answers0