2

I have a Pyspark dataframe :

ids names
[1, 1, 2, 3, 1, 2, 3, 7, 5] [a, b, c, l, s, o, c, d, e]
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8] [s, l, h, p, q, g, c, d, p, s]
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7] [q, a, z, w, s, e, r, t, y, o, p, a, x]

I have two column which has same length of array, I want to split the first column array (ids) based on value 7 (inclusive) such as [1,2,3,7,4,6,7] => [[1,2,3,7],[4,6,7]]

if we have only one 7 then we will have only one array after split such as [1,2,3,4,7,8,0,5] => [[1,2,3,4,7]] after 7 if no 7 is there is have no use of it

also the same should reflect on another column (names) they are also splitted on the same index value of array and also produce the same length result as each id is attached with each names, hence we need to get the same split with name column same as ids

Output should be:

ids names ids_splited names_splited
[1, 1, 2, 3, 1, 2, 3, 7, 5] [a, b, c, l, s, o, c, d, e] [[1, 1, 2, 3, 1, 2, 3, 7]] [[a, b, c, l, s, o, c, d]]
[3, 8, 9, 3, 9, 0, 0, 6, 7, 8] [s, l, h, p, q, g, c, d, p, s] [[3, 8, 9, 3, 9, 0, 0, 6, 7]] [[s, l, h, p, q, g, c, d, p]]
[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7] [q, a, z, w, s, e, r, t, y, o, p, a, x] [[9, 6, 5, 4, 7], [6, 5, 9, 2, 5, 5, 4, 7]] [[q, a, z, w, s], [e, r, t, y, o, p, a, x]]

I have tried with many options but i am not able to get this resolved.

Thanks in advance.

1 Answers1

1

Quick solution with rdd + map

def split(r):
    A, B = [], []
    a, b = [], []

    for x, y in zip(*r):
        if x != 7:
            a.append(x)
            b.append(y)
        else:
            A.append([*a, x])
            B.append([*b, y])

            a, b = [], [] # reset
    return [*r, A, B]

result = df.rdd.map(split).toDF(['ids', 'names', 'ids_splited', 'names_splited'])

Result

+---------------------------------------+---------------------------------------+-------------------------------------------+-------------------------------------------+
|ids                                    |names                                  |ids_splited                                |names_splited                              |
+---------------------------------------+---------------------------------------+-------------------------------------------+-------------------------------------------+
|[1, 1, 2, 3, 1, 2, 3, 7, 5]            |[a, b, c, l, s, o, c, d, e]            |[[1, 1, 2, 3, 1, 2, 3, 7]]                 |[[a, b, c, l, s, o, c, d]]                 |
|[3, 8, 9, 3, 9, 0, 0, 6, 7, 8]         |[s, l, h, p, q, g, c, d, p, s]         |[[3, 8, 9, 3, 9, 0, 0, 6, 7]]              |[[s, l, h, p, q, g, c, d, p]]              |
|[9, 6, 5, 4, 7, 6, 5, 9, 2, 5, 5, 4, 7]|[q, a, z, w, s, e, r, t, y, o, p, a, x]|[[9, 6, 5, 4, 7], [6, 5, 9, 2, 5, 5, 4, 7]]|[[q, a, z, w, s], [e, r, t, y, o, p, a, x]]|
+---------------------------------------+---------------------------------------+-------------------------------------------+-------------------------------------------+
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    That's awesome, it's a clear and crisp solution, Thanks @Shubham . Also I don't haveindex column as such and i am applying your split fucntion on two of the column not all so later if i want to join it then i do not have any index column to join or can i apply split function on partial columns and remaining dataframe will not get affected. – praveen kumar Mar 30 '23 at 11:30
  • 1
    got it, I have sent all the columns and while iterating thru for loop i have added all the columns and used only the required one. It's working fine! – praveen kumar Mar 30 '23 at 11:36
  • 1
    Great! Yes just use two columns in zip for e.g. `zip(r['ids'], r['names'])` – Shubham Sharma Mar 30 '23 at 11:38
  • [Shubham](https://stackoverflow.com/users/12833166/shubham-sharma) do you have any idea on this https://stackoverflow.com/questions/76258564/pyspark-how-to-attach-the-new-columns-from-other-pyspark-dataframe-based-on-mul? – praveen kumar May 16 '23 at 04:57
  • 1
    @praveenkumar I have added the answer. please check. – Shubham Sharma May 16 '23 at 18:39
  • yes [Shubham](https://stackoverflow.com/users/12833166/shubham-sharma) yes I am checking now, btw brother thanks a lot, learning from you. – praveen kumar May 19 '23 at 10:54
  • [shubham](https://stackoverflow.com/users/12833166/shubham-sharma) as I see I have huge dataset, and dealing with rdd as comapred to dataframe is slow, can you comment on this. – praveen kumar Jul 19 '23 at 09:09