pd.Series.explode results to a ValueError: cannot reindex from a duplicate axis

Question

my question is similar to this SO Question. I am trying to explode the below dataframe using the pd.explode function. A close inspection of the length of the values in the list, I discovered that I have varying list lengths in the column. e.g row 4 has length 10 for TGR1 while TGR2 and TGR3 has lenght of 11. going through the pandas pandas doc, I discovered that pd.explode function can through a ValueError due to the following reasons.

If columns of the frame are not unique.
If specified columns to explode is empty list.
If specified columns to explode have not matching count of elements rowwise in the frame.

my dataframe violate rule no 3. my question what are the possible ways I can transform each element of a list-like to a row when my list has varying length?

below is my explode function, dataframe and length of the list across the dataframe.

df = (df.set_index(['index']).apply(pd.Series.explode).reset_index())

  index            TGR1                                                  TGR2                         TGR3
0   0   [1, 7, 5, 9, 1, 1, 1, 1, 4, 4, 9, 7]            [5, 8, 1, 1, 7, 5, 6, 8, 2, 7, 5, 8]             [4, 1, 8, 3, 2, 4, 2, 5, 1, 5, 1, 1]
1   1   [1, 8, 5, 8, 1, 6, 3, 1, 1, 8, 8, 6, 6, 1, 5]   [6, 7, 7, 1, 2, 9, 6, 6, 8, 5, 1, 1, 1, 7, 2]    [2, 2, 1, 2, 6, 8, 8, 8, 2, 2, 7, 4, 7, 3, 6]
2   2   [7, 2, 1, 1, 3, 8, 2, 1, 8, 4, 8, 7]            [4, 4, 6, 5, 2, 3, 7, 6, 7, 7, 4, 1]             [2, 8, 5, 1, 6, 6, 1, 1, 4, 1, 5, 3]
3   3   [8, 6, 8, 6, 8, 5, 1, 6, 4, 1, 4, 4]            [3, 8, 1, 2, 4, 6, 6, 1, 7, 2, 7, 2]             [1, 3, 6, 10, 3, 2, 3, 2, 5, 7, 3, 5]
4   4   [2, 1, 1, 3, 4, 2, 5, 3, 1, 4]                  [2, 8, 8, 7, 8, 9, 1, 6, 4, 8, 1]              [7, 7, 6, 4, 2, 6, 7, 2, 1, 2, 3]
5   5   [8, 8, 4, 8, 8, 1, 1, 4, 5, 5, 2, 1]            [2, 2, 5, 1, 6, 6, 6, 8, 4, 1, 6, 8]            [1, 1, 8, 6, 3, 3, 2, 2, 2, 4, 1, 7]



    TGR1  TGR2   TGR3
     12    12     12
1    15    15     15
2    12    12     12
3    12    12     12
4    10    11     11
5    12    12     12

Expected output should be of this form.

  TGR1 TGR2 TGR3
0   1   5   4
1   7   8   1
2   5   1   8
3   9   1   3
4   1   7   2
5   1   5   4
6   1   6   2
7   1   8   5
8   4   2   1
9   4   7   5
10  9   5   1
11  7   8   1
12  1   6   2
13  8   7   2
14  5   7   1
15  8   1   2
16  1   2   6
17  6   9   8
18  3   6   8
19  1   6   8
20  1   8   2
21  8   5   2
22  8   1   7
23  6   1   4
24  6   1   7
25  1   7   3
26  5   2   6
27  7   4   2
28  2   4   8
29  1   6   5
30  1   5   1
31  3   2   6
32  8   3   6
33  2   7   1
34  1   6   1
35  8   7   4
36  4   7   1
37  8   4   5
38  7   1   3
39  8   3   1
40  6   8   3
41  8   1   6
42  6   2   10
43  8   4   3
44  5   6   2
45  1   6   3
46  6   1   2
47  4   7   5
48  1   2   7
49  4   7   3
50  4   2   5

@not_speshal I have updated the question with my expected output — chuky pedro, Oct 13 '21 at 17:55

score 2 · Answer 1 · answered Oct 13 '21 at 18:17

2

Try with stack and pivot:

stacked = df.stack().explode().reset_index()
stacked["uid"] = stacked.groupby(["level_0", "level_1"]).cumcount()
output = stacked.pivot(["level_0", "uid"], "level_1", 0).reset_index(drop=True).rename_axis(None, axis=1)

>>> output

        TGR1 TGR2 TGR3
0          1    5    4
1          7    8    1
2          5    1    8
3          9    1    3
4          1    7    2
..       ...  ...  ...
69         4    8    2
70         5    4    2
71         5    1    4
72         2    6    1
73         1    8    7

[74 rows x 3 columns]

answered Oct 13 '21 at 18:17

not_speshal

22,093
2
15
30

thanks for you response, I was able to resolve the `ValueError` issue using the below function `(df.set_index('index').apply(lambda x: x.apply(pd.Series).stack()).reset_index().drop('level_1', 1))` – chuky pedro Oct 13 '21 at 19:08
`apply` is not vectorized and not recommended. It will slow down your code substantially for larger dataframes. See [this](https://stackoverflow.com/a/54432584/9857631) – not_speshal Oct 13 '21 at 19:21
thanks for the eye opener, I have reverted back to you solution – chuky pedro Oct 13 '21 at 20:12

pd.Series.explode results to a ValueError: cannot reindex from a duplicate axis

1 Answers1