8

I want to expand the list in a certain column (in the example column_x) to multiple rows.

So

df = pd.DataFrame({'column_a': ['a_1', 'a_2'], 
                   'column_b': ['b_1', 'b_2'], 
                   'column_x': [['c_1', 'c_2'], ['d_1', 'd_2']]
                  })

shall be transformed from

    column_a    column_b    column_x
0   a_1         b_1         [c_1, c_2]
1   a_2         b_2         [d_1, d_2]

to

    column_a    column_b    column_x
0   a_1         b_1         c_1
1   a_1         b_1         c_2
2   a_2         b_2         d_1
3   a_2         b_2         d_2

The code I have so far does exactly this, and it does it fast.

lens = [len(item) for item in df['column_x']]
pd.DataFrame( {"column_a" : np.repeat(df['column_a'].values, lens), 
               "column_b" : np.repeat(df['column_b'].values, lens), 
               "column_x" : np.concatenate(df['column_x'].values)})

However, I have lots of columns. Is there a neat and elegant solution for repeating the whole data frame without specifying each column again?

cs95
  • 379,657
  • 97
  • 704
  • 746
Michael Dorner
  • 17,587
  • 13
  • 87
  • 117
  • 1
    I think here is main problem if all columns have same `dtypes`. Because if use `df = pd.DataFrame({'column_a': [1, 2], 'column_b': ['b_1', 'b_2'], 'column_x': [['c_1', 'c_2'], ['d_1', 'd_2']] })` then both solutions get different outputs - check it by `print (df1.dtypes)` vs `print (df.dtypes)`. Here is main problem numpy cast all dtypes to same, so more general solution is repeat `index` and if same `dtypes` of all column is possible use `cᴏʟᴅsᴘᴇᴇᴅ` solution too. It depends of data. – jezrael Mar 07 '18 at 12:37
  • This is an important remark! Thanks a lot. – Michael Dorner Mar 07 '18 at 15:12

2 Answers2

8

Pandas >= 0.25

Pandas can do this in a single function call via df.explode.

df.explode('column_x')

  column_a column_b column_x
0      a_1      b_1      c_1
0      a_1      b_1      c_2
1      a_2      b_2      d_1
1      a_2      b_2      d_2

Note that you can only explode a Series/DataFrame on one column.


Pandas < 0.25

Call np.repeat along the 0th axis for every column besides column_x.

df1 = pd.DataFrame(
    df.drop('column_x', 1).values.repeat(df['column_x'].str.len(), axis=0),
    columns=df.columns.difference(['column_x'])
)
df1['column_x'] = np.concatenate(df['column_x'].values)

df1

  column_a column_b column_x
0      a_1      b_1      c_1
1      a_1      b_1      c_2
2      a_2      b_2      d_1
3      a_2      b_2      d_2
Community
  • 1
  • 1
cs95
  • 379,657
  • 97
  • 704
  • 746
2

You can repeat index values:

lens = df['column_x'].str.len()
a = np.repeat(df.index.values, lens)
print (a)
[0 0 1 1]

df = df.loc[a].assign(column_x=np.concatenate(df['column_x'].values)).reset_index(drop=True)
print (df)
  column_a column_b column_x
0      a_1      b_1      c_1
1      a_1      b_1      c_2
2      a_2      b_2      d_1
3      a_2      b_2      d_2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Repeating index values is a pain, isn't it? Makes cell coords non-unique. – smci Mar 07 '18 at 09:16
  • @smci - You are right, but if need not change all dtypes to strings it is good solution – jezrael Mar 07 '18 at 10:19
  • jezrael there's no need to: `df.reset_index(inplace=True, drop=True)` will reindex the df, with integers. (Mind you it will invalidate any existing variables storing indices). – smci Mar 08 '18 at 22:37
  • From a performance perspective: This solution is almost double as fast as the solution of @cᴏʟᴅsᴘᴇᴇᴅ. – Michael Dorner Mar 09 '18 at 14:59