I have a Pyspark dataframe :
ids | names |
---|---|
[1, 1, 2, 3, 1, 2, 3, 7, 5] | [a, b, c, l, s, o, c, d, e] |
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8] | [s, l, h, p, q, g, c, d, p, s] |
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7] | [q, a, z, w, s, e, r, t, y, o, p, a, x] |
I have two column which has same length of array, I want to split the first column array (ids) based on value 7 (inclusive) such as [1,2,3,7,4,6,7] => [[1,2,3,7],[4,6,7]]
if we have only one 7 then we will have only one array after split such as [1,2,3,4,7,8,0,5] => [[1,2,3,4,7]]
after 7 if no 7 is there is have no use of it
also the same should reflect on another column (names) they are also splitted on the same index value of array and also produce the same length result as each id is attached with each names, hence we need to get the same split with name column same as ids
Output should be:
ids | names | ids_splited | names_splited |
---|---|---|---|
[1, 1, 2, 3, 1, 2, 3, 7, 5] | [a, b, c, l, s, o, c, d, e] | [[1, 1, 2, 3, 1, 2, 3, 7]] | [[a, b, c, l, s, o, c, d]] |
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8] | [s, l, h, p, q, g, c, d, p, s] | [[3, 8, 9, 3, 9, 0, 0, 6, 7]] | [[s, l, h, p, q, g, c, d, p]] |
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7] | [q, a, z, w, s, e, r, t, y, o, p, a, x] | [[9, 6, 5, 4, 7], [6, 5, 9, 2, 5, 5, 4, 7]] | [[q, a, z, w, s], [e, r, t, y, o, p, a, x]] |
I have tried with many options but i am not able to get this resolved.
Thanks in advance.