Expand pandas DataFrame column into multiple rows

Question

If I have a DataFrame such that:

pd.DataFrame( {"name" : "John", 
               "days" : [[1, 3, 5, 7]]
              })

gives this structure:

           days  name
0  [1, 3, 5, 7]  John

How do expand it to the following?

   days  name
0     1  John
1     3  John
2     5  John
3     7  John

I don't fully understand why you wan to do this? Is it because you have a dictionary like that and you want to turn it into a dataframe? And, in the column `name` you want to have the same value all the way through? — Joe T. Boka, Jul 05 '16 at 13:51

score 25 · Accepted Answer · edited Dec 13 '18 at 23:10

25

You could use df.itertuples to iterate through each row, and use a list comprehension to reshape the data into the desired form:

import pandas as pd

df = pd.DataFrame( {"name" : ["John", "Eric"], 
               "days" : [[1, 3, 5, 7], [2,4]]})
result = pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])
print(result)

yields

   0     1
0  1  John
1  3  John
2  5  John
3  7  John
4  2  Eric
5  4  Eric

Divakar's solution, using_repeat, is fastest:

In [48]: %timeit using_repeat(df)
1000 loops, best of 3: 834 µs per loop

In [5]: %timeit using_itertuples(df)
100 loops, best of 3: 3.43 ms per loop

In [7]: %timeit using_apply(df)
1 loop, best of 3: 379 ms per loop

In [8]: %timeit using_append(df)
1 loop, best of 3: 3.59 s per loop

Here is the setup used for the above benchmark:

import numpy as np
import pandas as pd

N = 10**3
df = pd.DataFrame( {"name" : np.random.choice(list('ABCD'), size=N), 
                    "days" : [np.random.randint(10, size=np.random.randint(5))
                              for i in range(N)]})

def using_itertuples(df):
    return  pd.DataFrame([(d, tup.name) for tup in df.itertuples() for d in tup.days])

def using_repeat(df):
    lens = [len(item) for item in df['days']]
    return pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
                          "days" : np.concatenate(df['days'].values)})

def using_apply(df):
    return (df.apply(lambda x: pd.Series(x.days), axis=1)
            .stack()
            .reset_index(level=1, drop=1)
            .to_frame('day')
            .join(df['name']))

def using_append(df):
    df2 = pd.DataFrame(columns = df.columns)
    for i,r in df.iterrows():
        for e in r.days:
            new_r = r.copy()
            new_r.days = e
            df2 = df2.append(new_r)
    return df2

edited Dec 13 '18 at 23:10

Aaron N. Brock

4,276
2
25
43

answered Jul 05 '16 at 12:31

unutbu

842,883
184
1,785
1,677

Bothering you, as I just modified mine to replace `np.concatenate` with `np.hstack`. Seems like it's a bit faster. Do you mind, updating the timings with it? :) – Divakar Jul 05 '16 at 13:10
@Divakar: I'm astonished. That's quite an improvement! – unutbu Jul 05 '16 at 13:38
Appreciate the updates! Well I am surprised to see np.hstack being faster than np.concatenate, as I was under the impression that all those hstack and vstacks are derived from np.concatenate. Maybe with pandas, it's doing optimizations? Not sure! – Divakar Jul 05 '16 at 13:43
`np.concatenate(df['days'].values)` seems to be even faster than `np.hstack(df['days'])`. – unutbu Jul 05 '16 at 13:48
Ah that's interesting and sort of makes sense I guess. So going through the route of NumPy array of obj dtype is better! Hmm. – Divakar Jul 05 '16 at 13:51
Amazing the speedup now, lovely! – Divakar Jul 05 '16 at 13:55
`np.concatenate(df['Days'])` is much slower than `np.concatenate(df['Days'].values)` because [under the hood](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L600) `np.concatenate` loops through the items in `df['Days']` using `PySequence_GetItem(df['Days'], iarrays)`, which is equivalent to `df['Days'][iarrays]`. This treats `iarray` as an index label, not an ordinal integer index. Looking up an index label is much slower than ordinal integer indexing. Performance disaster ensues. – unutbu Jul 05 '16 at 14:16
`np.hstack(df['Days'])` is pretty fast because this [converts `df['Days']` to a list](https://github.com/numpy/numpy/blob/master/numpy/core/shape_base.py#L283). So the indexing is again ordinal integer indexing. – unutbu Jul 05 '16 at 14:18
So, `np.concatenate` wasn't clever enough at it. Thanks for the insights there and digging in, appreciate it! – Divakar Jul 05 '16 at 14:19

score 21 · Answer 2 · answered Feb 03 '20 at 12:45

New since pandas 0.25 you can use the function explode()

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html

import pandas as pd
df = pd.DataFrame( {"name" : "John", 
               "days" : [[1, 3, 5, 7]]})

print(df.explode('days'))

prints

   name days
0  John    1
0  John    3
0  John    5
0  John    7

score 11 · Answer 3 · edited May 23 '17 at 11:48

Here's something with NumPy -

lens = [len(item) for item in df['days']]
df_out = pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
               "days" : np.hstack(df['days'])
              })

As pointed in @unutbu's solution np.concatenate(df['days'].values) would be faster than np.hstack(df['days']).

It uses a loop-comprehension to extract the lengths of each 'days' element, which must be minimal runtime-wise.

Sample run -

>>> df
           days  name
0  [1, 3, 5, 7]  John
1        [2, 4]  Eric
>>> lens = [len(item) for item in df['days']]
>>> pd.DataFrame( {"name" : np.repeat(df['name'].values,lens), 
...                "days" : np.hstack(df['days'])
...               })
   days  name
0     1  John
1     3  John
2     5  John
3     7  John
4     2  Eric
5     4  Eric

score 7 · Answer 4 · answered Jul 05 '16 at 12:49

A 'native' pandas solution - we unstack the column into a series, then join back on based on index:

import pandas as pd #import
x2 = x.days.apply(lambda x: pd.Series(x)).unstack() #make an unstackeded series, x2
x.drop('days', axis = 1).join(pd.DataFrame(x2.reset_index(level=0, drop=True))) #drop the days column, join to the x2 series

score 2 · Answer 5 · answered Jul 05 '16 at 12:34

2

another solution:

In [139]: (df.apply(lambda x: pd.Series(x.days), axis=1)
   .....:    .stack()
   .....:    .reset_index(level=1, drop=1)
   .....:    .to_frame('day')
   .....:    .join(df['name'])
   .....: )
Out[139]:
   day  name
0    1  John
0    3  John
0    5  John
0    7  John

answered Jul 05 '16 at 12:34

MaxU - stand with Ukraine

205,989
36
386
419

Fantastic answer! A slightly different need: what could be modified to leave NaN values to the frame when the arrays are like `[val, None, None, val2, None ...]` and they differ in size? – Felix Aug 04 '18 at 07:46
Self-reply: first replace the `None`s with some placeholder so pandas won't freak out. Then remove the `reset_index`. Move join before stack and optionally set the name-field as index before stacking to avoid repeating it. Afterwards replace placeholders with NaN. – Felix Aug 04 '18 at 08:15

AnnetteC · Answer 6 · 2016-07-06T07:22:05.137

1

Probably somehow like this:

df2 = pd.DataFrame(columns = df.columns)
for i,r in df.iterrows():
    for e in r.days:
        new_r = r.copy()
        new_r.days = e
        df2 = df2.append(new_r)
df2

edited Jul 06 '16 at 07:22

answered Jul 05 '16 at 12:41

AnnetteC

490
2
5
20

score 0 · Answer 7 · answered Oct 28 '18 at 09:28

Thanks to Divakar's solution, wrote it as a wrapper function to flatten a column, handling np.nan and DataFrames with multiple columns

def flatten_column(df, column_name):
     repeat_lens = [len(item) if item is not np.nan else 1 for item in df[column_name]]
     df_columns = list(df.columns)
     df_columns.remove(column_name)
     expanded_df = pd.DataFrame(np.repeat(df.drop(column_name, axis=1).values, repeat_lens, axis=0), columns=df_columns)
     flat_column_values = np.hstack(df[column_name].values)
     expanded_df[column_name] = flat_column_values
     expanded_df[column_name].replace('nan', np.nan, inplace=True)
     return expanded_df

score 0 · Answer 8 · answered May 10 '23 at 12:24

In case you ended up here, searching for a solution with multiple columns:

import pandas as pd
df = pd.DataFrame( {"name" : "John", 
               "days" : [[1, 3, 5, 7]]
               "values": [[10,20,30,40]]
              })
print(df)

           days  name values
0  [1, 3, 5, 7]  John [10,20,30,40]

the command:

print(df.explode(list(('days', 'values'))

will give the output:

   name days values
0  John    1 10
0  John    3 20
0  John    5 30
0  John    7 40

Note that the "expanded" columns must have the same length.

Expand pandas DataFrame column into multiple rows

8 Answers8

Linked