pandas split set column in a duplicated row if set is bigger than len(x)

Question

I have a dataframe that looks like this:

index      key                                   set_col          data
    0     "a1"                                ("a", "b")     "a1_data"   
    1     "a2"                      ("j", "k", "l", "m")     "a2_data"
    2     "b1"       ("z", "y", "x", "w", "v", "u", "t")     "b1_data"

I need to split the set_col, if the length of the set is higher than 3 elements and add it to a duplicated row, with the same data, resulting in this df:

index      key                                   set_col          data
    0     "a1"                                ("a", "b")     "a1_data"   
    1     "a2"                           ("j", "k", "l")     "a2_data"
    2     "a2"                                     ("m")     "a2_data"
    3     "b1"                           ("z", "y", "x")     "b1_data"
    4     "b1"                           ("w", "v", "u")     "b1_data"
    5     "b1"                                     ("t")     "b1_data"

I have read other answers using explode, replace or assign, like this or this but neither handles the case for splitting lists or sets to a length and duplicating the rows.

On this answer I found the following code:

def split(a, n):
    k, m = divmod(len(a), n)
    return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))

And I try to apply to the columns like this:

df['split_set_col'] = df['set_col'].apply(split(df['set_col'], 3))

But i get the Error:

pandas.errors.SpecificationError: nested renamer is not supported

Can you provide the constructor of the input? – mozway Feb 09 '23 at 16:48 — mozway, Feb 09 '23 at 16:48

score 2 · Accepted Answer · answered Feb 09 '23 at 16:49

Your function call is not right:

df['set_col'].apply(split(df['set_col'], 3))

Replace with:

df['set_col'].apply(split, n=3)  # note the n=3 as named argument

The function also contains errors, use np.array_split:

import numpy as np

def split(a, n):
    return np.array_split(a, np.arange(0, len(a), n)[1:])

df['split_set_col'] = df['set_col'].apply(split, n=3)

Output:

>>> df.explode('split_set_col', ignore_index=True)
    key                set_col       data split_set_col
0  "a1"                 (a, b)  "a1_data"        [a, b]
1  "a2"           (j, k, l, m)  "a2_data"     [j, k, l]
2  "a2"           (j, k, l, m)  "a2_data"           [m]
3  "b1"  (z, y, x, w, v, u, t)  "b1_data"     [z, y, x]
4  "b1"  (z, y, x, w, v, u, t)  "b1_data"     [w, v, u]
5  "b1"  (z, y, x, w, v, u, t)  "b1_data"           [t]

Feel free to convert list as string but the order will not preserved. — Corralien, Feb 09 '23 at 16:53

score 0 · Answer 2 · answered Feb 09 '23 at 17:40

Here is an option, assuming your set_col column are tuples:

(df[['key','data']].join(
    df['set_col'].explode()
    .to_frame()
    .assign(cc = lambda x: x.groupby(level=0).cumcount().floordiv(3))
    .set_index('cc',append=True)
    .groupby(level=[0,1])['set_col']
    .agg(tuple)
    .droplevel(1)))

Output:

  key     data    set_col
0  a1  a1_date     (a, b)
1  a2  a2_data  (j, k, l)
1  a2  a2_data       (m,)
2  b1  b1_data  (x, y, x)
2  b1  b1_data  (w, v, u)
2  b1  b1_data       (t,)

pandas split set column in a duplicated row if set is bigger than len(x)

2 Answers2