1

my question is similar to this SO Question. I am trying to explode the below dataframe using the pd.explode function. A close inspection of the length of the values in the list, I discovered that I have varying list lengths in the column. e.g row 4 has length 10 for TGR1 while TGR2 and TGR3 has lenght of 11. going through the pandas pandas doc, I discovered that pd.explode function can through a ValueError due to the following reasons.

  • If columns of the frame are not unique.
  • If specified columns to explode is empty list.
  • If specified columns to explode have not matching count of elements rowwise in the frame.

my dataframe violate rule no 3. my question what are the possible ways I can transform each element of a list-like to a row when my list has varying length?

below is my explode function, dataframe and length of the list across the dataframe.

df = (df.set_index(['index']).apply(pd.Series.explode).reset_index())

  index            TGR1                                                  TGR2                         TGR3
0   0   [1, 7, 5, 9, 1, 1, 1, 1, 4, 4, 9, 7]            [5, 8, 1, 1, 7, 5, 6, 8, 2, 7, 5, 8]             [4, 1, 8, 3, 2, 4, 2, 5, 1, 5, 1, 1]
1   1   [1, 8, 5, 8, 1, 6, 3, 1, 1, 8, 8, 6, 6, 1, 5]   [6, 7, 7, 1, 2, 9, 6, 6, 8, 5, 1, 1, 1, 7, 2]    [2, 2, 1, 2, 6, 8, 8, 8, 2, 2, 7, 4, 7, 3, 6]
2   2   [7, 2, 1, 1, 3, 8, 2, 1, 8, 4, 8, 7]            [4, 4, 6, 5, 2, 3, 7, 6, 7, 7, 4, 1]             [2, 8, 5, 1, 6, 6, 1, 1, 4, 1, 5, 3]
3   3   [8, 6, 8, 6, 8, 5, 1, 6, 4, 1, 4, 4]            [3, 8, 1, 2, 4, 6, 6, 1, 7, 2, 7, 2]             [1, 3, 6, 10, 3, 2, 3, 2, 5, 7, 3, 5]
4   4   [2, 1, 1, 3, 4, 2, 5, 3, 1, 4]                  [2, 8, 8, 7, 8, 9, 1, 6, 4, 8, 1]              [7, 7, 6, 4, 2, 6, 7, 2, 1, 2, 3]
5   5   [8, 8, 4, 8, 8, 1, 1, 4, 5, 5, 2, 1]            [2, 2, 5, 1, 6, 6, 6, 8, 4, 1, 6, 8]            [1, 1, 8, 6, 3, 3, 2, 2, 2, 4, 1, 7]



    TGR1  TGR2   TGR3
     12    12     12
1    15    15     15
2    12    12     12
3    12    12     12
4    10    11     11
5    12    12     12

Expected output should be of this form.

  TGR1 TGR2 TGR3
0   1   5   4
1   7   8   1
2   5   1   8
3   9   1   3
4   1   7   2
5   1   5   4
6   1   6   2
7   1   8   5
8   4   2   1
9   4   7   5
10  9   5   1
11  7   8   1
12  1   6   2
13  8   7   2
14  5   7   1
15  8   1   2
16  1   2   6
17  6   9   8
18  3   6   8
19  1   6   8
20  1   8   2
21  8   5   2
22  8   1   7
23  6   1   4
24  6   1   7
25  1   7   3
26  5   2   6
27  7   4   2
28  2   4   8
29  1   6   5
30  1   5   1
31  3   2   6
32  8   3   6
33  2   7   1
34  1   6   1
35  8   7   4
36  4   7   1
37  8   4   5
38  7   1   3
39  8   3   1
40  6   8   3
41  8   1   6
42  6   2   10
43  8   4   3
44  5   6   2
45  1   6   3
46  6   1   2
47  4   7   5
48  1   2   7
49  4   7   3
50  4   2   5
chuky pedro
  • 756
  • 1
  • 8
  • 26

1 Answers1

2

Try with stack and pivot:

stacked = df.stack().explode().reset_index()
stacked["uid"] = stacked.groupby(["level_0", "level_1"]).cumcount()
output = stacked.pivot(["level_0", "uid"], "level_1", 0).reset_index(drop=True).rename_axis(None, axis=1)

>>> output

        TGR1 TGR2 TGR3
0          1    5    4
1          7    8    1
2          5    1    8
3          9    1    3
4          1    7    2
..       ...  ...  ...
69         4    8    2
70         5    4    2
71         5    1    4
72         2    6    1
73         1    8    7

[74 rows x 3 columns]
not_speshal
  • 22,093
  • 2
  • 15
  • 30
  • thanks for you response, I was able to resolve the `ValueError` issue using the below function `(df.set_index('index').apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1))` – chuky pedro Oct 13 '21 at 19:08
  • `apply` is not vectorized and not recommended. It will slow down your code substantially for larger dataframes. See [this](https://stackoverflow.com/a/54432584/9857631) – not_speshal Oct 13 '21 at 19:21
  • thanks for the eye opener, I have reverted back to you solution – chuky pedro Oct 13 '21 at 20:12