Explode column of list to multiple rows

Question

I want to expand the list in a certain column (in the example column_x) to multiple rows.

So

df = pd.DataFrame({'column_a': ['a_1', 'a_2'], 
                   'column_b': ['b_1', 'b_2'], 
                   'column_x': [['c_1', 'c_2'], ['d_1', 'd_2']]
                  })

shall be transformed from

    column_a    column_b    column_x
0   a_1         b_1         [c_1, c_2]
1   a_2         b_2         [d_1, d_2]

to

    column_a    column_b    column_x
0   a_1         b_1         c_1
1   a_1         b_1         c_2
2   a_2         b_2         d_1
3   a_2         b_2         d_2

The code I have so far does exactly this, and it does it fast.

lens = [len(item) for item in df['column_x']]
pd.DataFrame( {"column_a" : np.repeat(df['column_a'].values, lens), 
               "column_b" : np.repeat(df['column_b'].values, lens), 
               "column_x" : np.concatenate(df['column_x'].values)})

However, I have lots of columns. Is there a neat and elegant solution for repeating the whole data frame without specifying each column again?

I think here is main problem if all columns have same `dtypes`. Because if use `df = pd.DataFrame({'column_a': [1, 2], 'column_b': ['b_1', 'b_2'], 'column_x': [['c_1', 'c_2'], ['d_1', 'd_2']] })` then both solutions get different outputs - check it by `print (df1.dtypes)` vs `print (df.dtypes)`. Here is main problem numpy cast all dtypes to same, so more general solution is repeat `index` and if same `dtypes` of all column is possible use `cᴏʟᴅsᴘᴇᴇᴅ` solution too. It depends of data. — jezrael, Mar 07 '18 at 12:37

score 8 · Answer 1 · edited Jun 20 '20 at 09:12

Pandas >= 0.25

Pandas can do this in a single function call via df.explode.

df.explode('column_x')

  column_a column_b column_x
0      a_1      b_1      c_1
0      a_1      b_1      c_2
1      a_2      b_2      d_1
1      a_2      b_2      d_2

Note that you can only explode a Series/DataFrame on one column.

Pandas < 0.25

Call np.repeat along the 0^th axis for every column besides column_x.

df1 = pd.DataFrame(
    df.drop('column_x', 1).values.repeat(df['column_x'].str.len(), axis=0),
    columns=df.columns.difference(['column_x'])
)
df1['column_x'] = np.concatenate(df['column_x'].values)

df1

  column_a column_b column_x
0      a_1      b_1      c_1
1      a_1      b_1      c_2
2      a_2      b_2      d_1
3      a_2      b_2      d_2

score 2 · Accepted Answer · answered Mar 07 '18 at 09:14

2

You can repeat index values:

lens = df['column_x'].str.len()
a = np.repeat(df.index.values, lens)
print (a)
[0 0 1 1]

df = df.loc[a].assign(column_x=np.concatenate(df['column_x'].values)).reset_index(drop=True)
print (df)
  column_a column_b column_x
0      a_1      b_1      c_1
1      a_1      b_1      c_2
2      a_2      b_2      d_1
3      a_2      b_2      d_2

answered Mar 07 '18 at 09:14

jezrael

822,522
95
1,334
1,252

Repeating index values is a pain, isn't it? Makes cell coords non-unique. – smci Mar 07 '18 at 09:16
@smci - You are right, but if need not change all dtypes to strings it is good solution – jezrael Mar 07 '18 at 10:19
jezrael there's no need to: `df.reset_index(inplace=True, drop=True)` will reindex the df, with integers. (Mind you it will invalidate any existing variables storing indices). – smci Mar 08 '18 at 22:37
From a performance perspective: This solution is almost double as fast as the solution of @cᴏʟᴅsᴘᴇᴇᴅ. – Michael Dorner Mar 09 '18 at 14:59

Explode column of list to multiple rows

2 Answers2

Pandas >= 0.25

Pandas < 0.25

Linked