Is there an inverse function for pyspark.sql.functions.explode? Rather than exploding an array into separate rows in an Apache Spark DataFrame, I need to create arrays based on the non-zero elements stored in a DataFrame.
- Input: DataFrame with columns (key1, key2, array_index, array_value)
- Output: DataFrame with columns (key1, key2, array[]), reducing by the (key1, key2) pairs.
I want to make sure this mapping function is carried out in a distributed fashion on the worker nodes and not in serial fashion on the driver node. The general approach suggested at https://blogs.msdn.microsoft.com/azuredatalake/2016/02/10/pyspark-appending-columns-to-dataframe-when-dataframe-withcolumn-cannot-be-used/ looks promising, but I wasn't sure of the best way to address my array creation problem.