How to explode a list inside a Dataframe cell into separate rows

Question

I'm looking to turn a pandas cell containing a list into rows for each of those values.

So, take this:

If I'd like to unpack and stack the values in the nearest_neighbors column so that each value would be a row within each opponent index, how would I best go about this? Are there pandas methods that are meant for operations like this?

For most cases, the correct answer is to now use [`pandas.DataFrame.explode()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) as shown in this [answer](https://stackoverflow.com/a/57105840/7758804), or [`pandas.Series.explode`](https://pandas.pydata.org/docs/reference/api/pandas.Series.explode.html). — Trenton McKinney, May 15 '21 at 21:21

joelostblom · Answer 1 · 2021-05-15T21:38:38.817

Exploding a list-like column has been simplified significantly in pandas 0.25 with the addition of the explode() method:

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))

df.explode('nearest_neighbors')

Out:

                    nearest_neighbors
name       opponent                  
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

Note that this only works for a single column (as of 0.25). See [here](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe?r=SearchResults&s=3|49.3211) and [here](https://stackoverflow.com/a/50731254/4909087) for more generic solutions. — cs95, Jul 20 '19 at 07:14

Alexander · Accepted Answer · 2017-06-22T05:45:56.240

In the code below, I first reset the index to make the row iteration easier.

I create a list of lists where each element of the outer list is a row of the target DataFrame and each element of the inner list is one of the columns. This nested list will ultimately be concatenated to create the desired DataFrame.

I use a lambda function together with list iteration to create a row for each element of the nearest_neighbors paired with the relevant name and opponent.

Finally, I create a new DataFrame from this list (using the original column names and setting the index back to name and opponent).

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))

>>> df
                                                    nearest_neighbors
name       opponent                                                  
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]

df.reset_index(inplace=True)
rows = []
_ = df.apply(lambda row: [rows.append([row['name'], row['opponent'], nn]) 
                         for nn in row.nearest_neighbors], axis=1)
df_new = pd.DataFrame(rows, columns=df.columns).set_index(['name', 'opponent'])

>>> df_new
                    nearest_neighbors
name       opponent                  
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

EDIT JUNE 2017

An alternative method is as follows:

>>> (pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name='nearest_neighbors')
     .set_index(['name', 'opponent'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

score 36 · Answer 3 · answered Sep 11 '17 at 17:58

Use apply(pd.Series) and stack, then reset_index and to_frame

In [1803]: (df.nearest_neighbors.apply(pd.Series)
              .stack()
              .reset_index(level=2, drop=True)
              .to_frame('nearest_neighbors'))
Out[1803]:
                    nearest_neighbors
name       opponent
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

Details

In [1804]: df
Out[1804]:
                                                   nearest_neighbors
name       opponent
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]

Love the elegance of your solution! Have you benchmarked it against other approaches by any chance? — rpyzh, Feb 12 '18 at 22:06
The result of `df.nearest_neighbors.apply(pd.Series)` is very astonishing to me; — Calum You, Oct 26 '18 at 00:13

maxymoo · Answer 4 · 2018-11-21T01:18:28.320

17

I think this a really good question, in Hive you would use EXPLODE, I think there is a case to be made that Pandas should include this functionality by default. I would probably explode the list column with a nested generator comprehension like this:

pd.DataFrame({
    "name": i[0],
    "opponent": i[1],
    "nearest_neighbor": neighbour
    }
    for i, row in df.iterrows() for neighbour in row.nearest_neighbors
    ).set_index(["name", "opponent"])

edited Nov 21 '18 at 01:18

answered Sep 09 '15 at 00:27

maxymoo

35,286
11
92
119

I like how this solution allows for the number of list items to be different for each row. – user1718097 Sep 07 '18 at 22:02
Is there a way to keep the original index with this method? – SummerEla Nov 20 '18 at 18:20
2

@SummerEla lol this was a really old answer, i've updated to show how i would do it now – maxymoo Nov 21 '18 at 01:10
1

@maxymoo It's still a great question, though. Thanks for updating! – SummerEla Nov 21 '18 at 18:27
I found this useful and turned it into a [package](https://github.com/orenovadia/pandas_explode) – Oren Jun 04 '19 at 01:26

Oleg · Answer 5 · 2018-09-06T07:29:45.090

The fastest method I found so far is extending the DataFrame with .iloc and assigning back the flattened target column.

Given the usual input (replicated a bit):

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))
df = pd.concat([df]*10)

df
Out[3]: 
                                                   nearest_neighbors
name       opponent                                                 
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
...

Given the following suggested alternatives:

col_target = 'nearest_neighbors'

def extend_iloc():
    # Flatten columns of lists
    col_flat = [item for sublist in df[col_target] for item in sublist] 
    # Row numbers to repeat 
    lens = df[col_target].apply(len)
    vals = range(df.shape[0])
    ilocations = np.repeat(vals, lens)
    # Replicate rows and add flattened column of lists
    cols = [i for i,c in enumerate(df.columns) if c != col_target]
    new_df = df.iloc[ilocations, cols].copy()
    new_df[col_target] = col_flat
    return new_df

def melt():
    return (pd.melt(df[col_target].apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name=col_target)
            .set_index(['name', 'opponent'])
            .drop('variable', axis=1)
            .dropna()
            .sort_index())

def stack_unstack():
    return (df[col_target].apply(pd.Series)
            .stack()
            .reset_index(level=2, drop=True)
            .to_frame(col_target))

I find that extend_iloc() is the fastest:

%timeit extend_iloc()
3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit melt()
22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit stack_unstack()
11.5 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Thanks for this, it really helped me. I used the extend_iloc solution and found that `cols = [c for c in df.columns if c != col_target]` should be: `cols = [i for i,c in enumerate(df.columns) if c != col_target]` The `df.iloc[ilocations, cols].copy()` errors if not presented with the column index. — jdungan, Sep 03 '18 at 21:28
Thanks again for the iloc suggestion. I wrote a detailed explaination of how it works here: https://medium.com/@johnadungan/expanding-lists-in-panda-dataframes-2724803498f8. Hope it helps anyone with a similar challenge. — jdungan, Jan 17 '19 at 17:52

Philipp Schwarz · Answer 6 · 2016-11-20T16:27:53.283

7

Nicer alternative solution with apply(pd.Series):

df = pd.DataFrame({'listcol':[[1,2,3],[4,5,6]]})

# expand df.listcol into its own dataframe
tags = df['listcol'].apply(pd.Series)

# rename each variable is listcol
tags = tags.rename(columns = lambda x : 'listcol_' + str(x))

# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)

edited Nov 20 '16 at 16:27

answered Nov 20 '16 at 15:53

Philipp Schwarz

18,050
5
32
36

This one expands columns not rows. – Oleg Feb 20 '18 at 20:14
1

@Oleg right, but you can always transpose the DataFrame and then apply pd.Series -way simpler than most other suggestions – Philipp Schwarz Apr 04 '20 at 11:01

score 7 · Answer 7 · edited Feb 04 '18 at 20:33

Similar to Hive's EXPLODE functionality:

import copy

def pandas_explode(df, column_to_explode):
    """
    Similar to Hive's EXPLODE function, take a column with iterable elements, and flatten the iterable to one element 
    per observation in the output table

    :param df: A dataframe to explod
    :type df: pandas.DataFrame
    :param column_to_explode: 
    :type column_to_explode: str
    :return: An exploded data frame
    :rtype: pandas.DataFrame
    """

    # Create a list of new observations
    new_observations = list()

    # Iterate through existing observations
    for row in df.to_dict(orient='records'):

        # Take out the exploding iterable
        explode_values = row[column_to_explode]
        del row[column_to_explode]

        # Create a new observation for every entry in the exploding iterable & add all of the other columns
        for explode_value in explode_values:

            # Deep copy existing observation
            new_observation = copy.deepcopy(row)

            # Add one (newly flattened) value from exploding iterable
            new_observation[column_to_explode] = explode_value

            # Add to the list of new observations
            new_observations.append(new_observation)

    # Create a DataFrame
    return_df = pandas.DataFrame(new_observations)

    # Return
    return return_df

When I run this, I get the following error: `NameError: global name 'copy' is not defined` — frmsaul, Aug 08 '17 at 17:53

Brian Wylie · Answer 8 · 2021-10-11T16:13:46.563

So all of these answers are good but I wanted something ^really simple^ so here's my contribution:

def explode(series):
    return pd.Series([x for inner_list in series for x in inner_list])

That's it.. just use this when you want a new series where the lists are 'exploded'. Here's an example where we do value_counts()

In[1]: df = pd.DataFrame({'column': [['a','b','c'],['b','c'],['c']]})
In [2]: df
Out[2]:
      column
0  [a, b, c]
1     [b, c]
2        [c]

In [3]: explode(df['column'])
Out[3]:
0    a
1    b
2    c
3    b
4    c
5    c

In [4]: explode(df['column']).value_counts()
Out[4]:
c    3
b    2
a    1

score 2 · Answer 9 · answered Nov 12 '17 at 01:57

Here is a potential optimization for larger dataframes. This runs faster when there are several equal values in the "exploding" field. (The larger the dataframe is compared to the unique value count in the field, the better this code will perform.)

def lateral_explode(dataframe, fieldname): 
    temp_fieldname = fieldname + '_made_tuple_' 
    dataframe[temp_fieldname] = dataframe[fieldname].apply(tuple)       
    list_of_dataframes = []
    for values in dataframe[temp_fieldname].unique().tolist(): 
        list_of_dataframes.append(pd.DataFrame({
            temp_fieldname: [values] * len(values), 
            fieldname: list(values), 
        }))
    dataframe = dataframe[list(set(dataframe.columns) - set([fieldname]))]\ 
        .merge(pd.concat(list_of_dataframes), how='left', on=temp_fieldname) 
    del dataframe[temp_fieldname]

    return dataframe

Brian Atwood · Answer 10 · 2018-03-21T05:06:07.227

Extending Oleg's .iloc answer to automatically flatten all list-columns:

def extend_iloc(df):
    cols_to_flatten = [colname for colname in df.columns if 
    isinstance(df.iloc[0][colname], list)]
    # Row numbers to repeat 
    lens = df[cols_to_flatten[0]].apply(len)
    vals = range(df.shape[0])
    ilocations = np.repeat(vals, lens)
    # Replicate rows and add flattened column of lists
    with_idxs = [(i, c) for (i, c) in enumerate(df.columns) if c not in cols_to_flatten]
    col_idxs = list(zip(*with_idxs)[0])
    new_df = df.iloc[ilocations, col_idxs].copy()

    # Flatten columns of lists
    for col_target in cols_to_flatten:
        col_flat = [item for sublist in df[col_target] for item in sublist]
        new_df[col_target] = col_flat

    return new_df

This assumes that each list-column has equal list length.

score 1 · Answer 11 · answered Jan 08 '20 at 13:49

Instead of using apply(pd.Series) you can flatten the column. This improves performance.

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                'opponent': ['76ers', 'blazers', 'bobcats'], 
                'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
  .set_index(['name', 'opponent']))



%timeit (pd.DataFrame(df['nearest_neighbors'].values.tolist(), index = df.index)
           .stack()
           .reset_index(level = 2, drop=True).to_frame('nearest_neighbors'))

1.87 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%timeit (df.nearest_neighbors.apply(pd.Series)
          .stack()
          .reset_index(level=2, drop=True)
          .to_frame('nearest_neighbors'))

2.73 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

IndexError: Too many levels: Index has only 2 levels, not 3, when I try my example — vinsent paramanantham, Jan 17 '20 at 13:05
You have to change "level" in reset_index according to your example — suleep kumar, Mar 09 '20 at 13:19

How to explode a list inside a Dataframe cell into separate rows

11 Answers11

Linked

Related