5

Let's say I have a data frame df:

   C1      C2      C3      C4               C5
0  [A]     [1]     s1      [123]            t1  
1  [A]     [1]     s2      321              t2
2  [A,B]   [1,2]   s3      [777,111]        t3
3  [B]     [2]     s4      145              t4
4  [B]     [2]     s5      [990]            t5
5  [A,B,B] [1,2,2] s6      [124,125,765]    t6
6  [A,A]   [1,3]   s7      119              t7

I want to explode everything out, so I have been doing

df = df.apply(pd.Series.explode)

However, this gives me ValueError: cannot reindex from a duplicate axis. I have traced the culprit to the row 6 (last row) of df. I understood when I got this before when I had things in C1 that were not the same length as what was in C2. But I don't understand what's wrong with exploding that last row.

If I do pd.DataFrame([[['A','B'],[1,2],'s7',119,'t7']]).apply(pd.Series.explode(), it works fine and as expected giving me the following:

   C1      C2      C3      C4         C5
0  A        1      s7      119        t7  
1  A        3      s7      119        t7

I can't figure out why that last row causes an error when it is part of the whole data frame. I have check the index and it is all unique.

formicaman
  • 1,317
  • 3
  • 16
  • 32
  • Does this answer your question? [Efficient way to unnest (explode) multiple list columns in a pandas DataFrame](https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe) – sushanth Jul 28 '20 at 17:03
  • This doesn't answer my question, but this is where is got `apply(pd.Series.explode)`. I don't know why it's not working on that last row. – formicaman Jul 28 '20 at 17:05
  • @formicaman can you give the first dataframe in a way easier to duplicates? like with `df.head(7).to_dict()`? – Ben.T Jul 28 '20 at 18:00
  • Evidently every list must be the same length to use `.apply(pd.Series.explode)` -- more details here: https://stackoverflow.com/a/59330040/13608599 – jsmart Jul 28 '20 at 18:52

1 Answers1

5

Cleanup as you go and it works.

df = pd.DataFrame({'C1': [["A"], ["A"], ["A","B"], ["B"], ["B"], ["A","B","B"], ["A","A"]],
 'C2': [[1], [1], [1,2], [2], [2], [1,2,2], [1,3]],
 'C3': ['s1', 's2', 's3', 's4', 's5', 's6', 's7'],
 'C4': [[123], 321, [777,111], 145, [990], [124,125,765], 119],
 'C5': ['t1', 't2', 't3', 't4', 't5', 't6', 't7']})

df.explode("C1").reset_index().drop("index",1).explode("C2").reset_index()\
    .drop("index",1).explode("C4").reset_index().drop("index",1)


output

  C1 C2  C3   C4  C5
0  A  1  s1  123  t1
1  A  1  s2  321  t2
2  A  1  s3  777  t3
3  A  1  s3  111  t3
4  A  2  s3  777  t3
5  A  2  s3  111  t3
6  B  1  s3  777  t3
7  B  1  s3  111  t3
8  B  2  s3  777  t3
9  B  2  s3  111  t3
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30