Convert column of lists into one column of values in Pyspark

Question

I have run an ALS model, and am working with the output recommendations. They are currently in this format:

|user_id|item_ids|    ratings   |  
|   72  | [5,2,3]| [1.2,3.6,4.1]| 
|   16  | [1,9,5]| [4.4,2.0,3.9]|

Each row contains a list in the "item_ids" column and a list in the "ratings" column.

What I'd like to do is "unpivot" each list in "item_ids" and "ratings" so that it looks like this:

|user_id|item_ids|ratings|
|  72   |   5    |  1.2  |  
|  72   |   2    |  3.6  |  
|  72   |   3    |  4.1  |  
|  16   |   1    |  4.4  | 
|  16   |   9    |  2.0  | 
|  16   |   5    |  3.9  |

The only way I have found to do this is doing so one user at a time by filtering for one user_id, converting the lists to columns in a Pandas dataframe, and then round-tripping that back to a Spark dataframe. As you might imagine, that would take an eternity to complete with a few million user_ids. I have also considered using the "stack" function described here, but the manual inputs required would also take an eternity. I have also considered the code found here, but it doesn't work when lists are involved.

Is there a more efficient way to do this without having to lean on Pandas? I'm running Pyspark (Spark 2.2.0) on Databricks.

You could do something like `df.select("user_id", explode("item_ids").alias("item_ids"), monotonically_increasing_id().alias("id")).join(df.select(explode("ratings").alias("ratings"), monotonically_increasing_id().alias("id")), "id").show()` perhaps but there is probably a better way — David Arenburg, Aug 23 '17 at 07:38
I'm not sure why any of those solutions in the links you have shared won't work — eliasah, Aug 23 '17 at 08:00

Convert column of lists into one column of values in Pyspark

0 Answers0