PySpark: How to split the array based on value in pyspark dataframe, aslo reflect the same with corrsponding another column with array type

Question

I have a Pyspark dataframe :

ids	names
[1, 1, 2, 3, 1, 2, 3, 7, 5]	[a, b, c, l, s, o, c, d, e]
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8]	[s, l, h, p, q, g, c, d, p, s]
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7]	[q, a, z, w, s, e, r, t, y, o, p, a, x]

I have two column which has same length of array, I want to split the first column array (ids) based on value 7 (inclusive) such as [1,2,3,7,4,6,7] => [[1,2,3,7],[4,6,7]]

if we have only one 7 then we will have only one array after split such as [1,2,3,4,7,8,0,5] => [[1,2,3,4,7]] after 7 if no 7 is there is have no use of it

also the same should reflect on another column (names) they are also splitted on the same index value of array and also produce the same length result as each id is attached with each names, hence we need to get the same split with name column same as ids

Output should be:

ids	names	ids_splited	names_splited
[1, 1, 2, 3, 1, 2, 3, 7, 5]	[a, b, c, l, s, o, c, d, e]	[[1, 1, 2, 3, 1, 2, 3, 7]]	[[a, b, c, l, s, o, c, d]]
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8]	[s, l, h, p, q, g, c, d, p, s]	[[3, 8, 9, 3, 9, 0, 0, 6, 7]]	[[s, l, h, p, q, g, c, d, p]]
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7]	[q, a, z, w, s, e, r, t, y, o, p, a, x]	[[9, 6, 5, 4, 7], [6, 5, 9, 2, 5, 5, 4, 7]]	[[q, a, z, w, s], [e, r, t, y, o, p, a, x]]

I have tried with many options but i am not able to get this resolved.

Thanks in advance.

what is your pyspark version ? – Nikhil Suthar Mar 30 '23 at 08:46 — Nikhil Suthar, Mar 30 '23 at 08:46

score 1 · Accepted Answer · answered Mar 30 '23 at 10:23

1

Quick solution with `rdd` + `map`

def split(r):
    A, B = [], []
    a, b = [], []

    for x, y in zip(*r):
        if x != 7:
            a.append(x)
            b.append(y)
        else:
            A.append([*a, x])
            B.append([*b, y])

            a, b = [], [] # reset
    return [*r, A, B]

result = df.rdd.map(split).toDF(['ids', 'names', 'ids_splited', 'names_splited'])

Result

+---------------------------------------+---------------------------------------+-------------------------------------------+-------------------------------------------+
|ids                                    |names                                  |ids_splited                                |names_splited                              |
+---------------------------------------+---------------------------------------+-------------------------------------------+-------------------------------------------+
|[1, 1, 2, 3, 1, 2, 3, 7, 5]            |[a, b, c, l, s, o, c, d, e]            |[[1, 1, 2, 3, 1, 2, 3, 7]]                 |[[a, b, c, l, s, o, c, d]]                 |
|[3, 8, 9, 3, 9, 0, 0, 6, 7, 8]         |[s, l, h, p, q, g, c, d, p, s]         |[[3, 8, 9, 3, 9, 0, 0, 6, 7]]              |[[s, l, h, p, q, g, c, d, p]]              |
|[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7]|[q, a, z, w, s, e, r, t, y, o, p, a, x]|[[9, 6, 5, 4, 7], [6, 5, 9, 2, 5, 5, 4, 7]]|[[q, a, z, w, s], [e, r, t, y, o, p, a, x]]|
+---------------------------------------+---------------------------------------+-------------------------------------------+-------------------------------------------+

answered Mar 30 '23 at 10:23

Shubham Sharma

68,127
6
24
53

1

That's awesome, it's a clear and crisp solution, Thanks @Shubham . Also I don't haveindex column as such and i am applying your split fucntion on two of the column not all so later if i want to join it then i do not have any index column to join or can i apply split function on partial columns and remaining dataframe will not get affected. – praveen kumar Mar 30 '23 at 11:30
1

got it, I have sent all the columns and while iterating thru for loop i have added all the columns and used only the required one. It's working fine! – praveen kumar Mar 30 '23 at 11:36
1

Great! Yes just use two columns in zip for e.g. `zip(r['ids'], r['names'])` – Shubham Sharma Mar 30 '23 at 11:38
[Shubham](https://stackoverflow.com/users/12833166/shubham-sharma) do you have any idea on this https://stackoverflow.com/questions/76258564/pyspark-how-to-attach-the-new-columns-from-other-pyspark-dataframe-based-on-mul? – praveen kumar May 16 '23 at 04:57
1

@praveenkumar I have added the answer. please check. – Shubham Sharma May 16 '23 at 18:39
yes [Shubham](https://stackoverflow.com/users/12833166/shubham-sharma) yes I am checking now, btw brother thanks a lot, learning from you. – praveen kumar May 19 '23 at 10:54
[shubham](https://stackoverflow.com/users/12833166/shubham-sharma) as I see I have huge dataset, and dealing with rdd as comapred to dataframe is slow, can you comment on this. – praveen kumar Jul 19 '23 at 09:09

PySpark: How to split the array based on value in pyspark dataframe, aslo reflect the same with corrsponding another column with array type

1 Answers1

Quick solution with rdd + map

Quick solution with `rdd` + `map`