Add different arrays from numpy to each row of dataframe

Question

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.

For example, the PySpark dataframe is like this

| Id     | Name   |
| ------ | ------ |
| 1      | Bob    |
| 2      | Alice  |
| 3      | Mike   |

And the numpy matrix is like this

[[2, 3, 5]
 [5, 2, 6]
 [1, 4, 7]]

The resulting expected dataframe should be like this

| Id     | Name   | customized_list
| ------ | ------ | ---------------
| 1      | Bob    |   [2, 3, 5]
| 2      | Alice  |   [5, 2, 6]
| 3      | Mike   |   [1, 4, 7]

Id column correspond to the order of the entries in the numpy matrix.

I wonder is there any efficient way to implement this?

does the `Id` column correspond to the order of the entries in the `numpy` matrix? — pault, Oct 04 '19 at 18:44

score 2 · Answer 1 · answered Oct 04 '19 at 20:15

Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.

import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#|  1|      [2, 3, 5]|
#|  2|      [5, 2, 6]|
#|  3|      [1, 4, 7]|
#+---+---------------+

Here I used enumerate(..., start=1) to add the row number.

Now just do an inner join:

df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#|  1|  Bob|      [2, 3, 5]|
#|  3| Mike|      [1, 4, 7]|
#|  2|Alice|      [5, 2, 6]|
#+---+-----+---------------+

What's solution if I do not have the identifier like "Id", which is a list of increasing numbers starting from 1. — XIN LIU, Oct 04 '19 at 20:22
@XINLIU then you will have to add an `Id` column: [Pyspark add sequential and deterministic index to dataframe](https://stackoverflow.com/questions/52318016/pyspark-add-sequential-and-deterministic-index-to-dataframe). — pault, Oct 04 '19 at 20:25

Add different arrays from numpy to each row of dataframe

1 Answers1