362

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a should become b:

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
Out[8]: 
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

So far, I have tried various simple functions, but the .apply method seems to only accept one row as return value when it is used on an axis, and I can't get .transform to work. Any suggestions would be much appreciated!

Example data:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')
z.transform(fun)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Vincent
  • 15,809
  • 7
  • 37
  • 39
  • 3
    other solutions on this page are working but I found following one short and effective. https://stackoverflow.com/questions/27263805/pandas-when-cell-contents-are-lists-create-a-row-for-each-element-in-the-list – desaiankitb Nov 28 '18 at 07:33
  • 2
    For others arriving to this page and looking for a solution that keeps multiple columns, have a look at this question: https://stackoverflow.com/questions/17116814/pandas-how-do-i-split-text-in-a-column-into-multiple-rows – Sos Dec 14 '18 at 16:17

27 Answers27

245

UPDATE 3: it makes more sense to use Series.explode() / DataFrame.explode() methods (implemented in Pandas 0.25.0 and extended in Pandas 1.3.0 to support multi-column explode) as is shown in the usage example:

for a single column:

In [1]: df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
   ...:                    'B': 1,
   ...:                    'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})

In [2]: df
Out[2]:
           A  B          C
0  [0, 1, 2]  1  [a, b, c]
1        foo  1        NaN
2         []  1         []
3     [3, 4]  1     [d, e]

In [3]: df.explode('A')
Out[3]:
     A  B          C
0    0  1  [a, b, c]
0    1  1  [a, b, c]
0    2  1  [a, b, c]
1  foo  1        NaN
2  NaN  1         []
3    3  1     [d, e]
3    4  1     [d, e]

for multiple columns (for Pandas 1.3.0+):

In [4]: df.explode(['A', 'C'])
Out[4]:
     A  B    C
0    0  1    a
0    1  1    b
0    2  1    c
1  foo  1  NaN
2  NaN  1  NaN
3    3  1    d
3    4  1    e

UPDATE 2: more generic vectorized function, which will work for multiple normal and multiple list columns

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res

Demo:

Multiple list columns - all list columns must have the same # of elements in each row:

In [134]: df
Out[134]:
   aaa  myid        num          text
0   10     1  [1, 2, 3]  [aa, bb, cc]
1   11     2         []            []
2   12     3     [1, 2]      [cc, dd]
3   13     4         []            []

In [135]: explode(df, ['num','text'], fill_value='')
Out[135]:
   aaa  myid num text
0   10     1   1   aa
1   10     1   2   bb
2   10     1   3   cc
3   11     2
4   12     3   1   cc
5   12     3   2   dd
6   13     4

preserving original index values:

In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
Out[136]:
   aaa  myid num text
0   10     1   1   aa
0   10     1   2   bb
0   10     1   3   cc
1   11     2
2   12     3   1   cc
2   12     3   2   dd
3   13     4

Setup:

df = pd.DataFrame({
 'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
 'myid': {0: 1, 1: 2, 2: 3, 3: 4},
 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}
})

CSV column:

In [46]: df
Out[46]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
Out[47]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

using this little trick we can convert CSV-like column to list column:

In [48]: df.assign(var1=df.var1.str.split(','))
Out[48]:
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

UPDATE: generic vectorized approach (will work also for multiple columns):

Original DF:

In [177]: df
Out[177]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

Solution:

first let's convert CSV strings to lists:

In [178]: lst_col = 'var1' 

In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})

In [180]: x
Out[180]:
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

Now we can do this:

In [181]: pd.DataFrame({
     ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
     ...:     for col in x.columns.difference([lst_col])
     ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
     ...:
Out[181]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

OLD answer:

Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):

In [2]: df = pd.DataFrame(
   ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
   ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
   ...: )

In [3]: df
Out[3]:
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
   ...:    .var1.str.split(',', expand=True)
   ...:    .stack()
   ...:    .reset_index()
   ...:    .rename(columns={0:'var1'})
   ...:    .loc[:, df.columns]
   ...: )
Out[4]:
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ
wjandrea
  • 28,235
  • 9
  • 60
  • 81
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 25
    dude, if you can open a discussion in Git pandas , I think we do need a build in function like this !!! I have seen so many question about unlistify and unnesting in SO for pandas – BENY Sep 01 '17 at 16:45
  • 1
    how to use this for multiple columns. Like if I have comma separated data in 2 columns and want to do it in sequence? – Jaskaran Singh Puri Aug 21 '18 at 13:52
  • @JaskaranSinghPuri, you want to convert all CSV columns to lists first. – MaxU - stand with Ukraine Aug 21 '18 at 15:14
  • 1
    Unfornately, it doesn't work if your list elements are tuples. But after converting the entire tuple to string, it works like a charm! – Guido Nov 28 '18 at 14:12
  • This solution appears to be the fastest one except for the case there are many very short lists, see https://stackoverflow.com/a/54318064 for benchmarks. – krassowski Jan 23 '19 at 16:53
  • Would be a nice update if you could provide different delimiters if you have multiple columns, for example `sep=[',', '~']` – Erfan Jul 10 '19 at 16:47
  • @Erfan, you can easily do this: `.var1.str.split(r'[,~]', expand=True)` – MaxU - stand with Ukraine Jul 10 '19 at 18:40
  • 9
    Looks like WenBen's plea was heard by the pandas gods, they've installed a[`.explode()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas-dataframe-explode) method into the API (also see [this answer](https://stackoverflow.com/a/57122617/4909087)). – cs95 Jul 20 '19 at 07:23
  • @MaxU How to make it to work for different size list column? – Vikram Ranabhatt Mar 21 '20 at 16:25
  • @Chris_vr, just call it sequentially ;) – MaxU - stand with Ukraine Mar 21 '20 at 16:26
  • @MaxU at time 1 col1 10.5 10.6 20.4 30.7 40.8 50.3 60.9 70.2 80.1 90.0 and col2 0.9 10.8 20.7 30.6 40.5 50.4 60.3 70.0.one column has 10 value in list 2nd column as 8 values in list.if i run sequentially it will make 80 entry which is wrong.it should have max of two column i.e 10 rows and col2 should have none or empty value for 2 rows. How can i achieve this ? – Vikram Ranabhatt Mar 21 '20 at 17:36
  • @MaxU https://stackoverflow.com/questions/60799274/explode-multiple-coumns-with-different-size-list-columns-in-pandas – Vikram Ranabhatt Mar 22 '20 at 12:29
  • unfortunatly it gives me value error: **ValueError: Length of values does not match length of index** [link](https://stackoverflow.com/questions/61712122/exploding-python-dataframe-for-multiple-columns) @MaxU – Milon Sarker May 10 '20 at 12:54
  • Do anybody knows if pandas team created a function explode? – Corina Roca Jan 13 '21 at 14:23
  • 3
    @CorinaRoca, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html ;) – MaxU - stand with Ukraine Jan 13 '21 at 14:45
  • Why not just `df.set_index("var2")` instead of `df.set_index(df.columns.drop('var1',1).tolist())` in last code block? – ali bakhtiari Jan 05 '23 at 10:03
145

After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.

If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.

b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1
DMulligan
  • 8,993
  • 6
  • 33
  • 34
  • 4
    This solution worked significantly faster and appears to use less memory, – cyril Apr 15 '17 at 00:06
  • 3
    This is a nice vectorized pandas solution, I was looking for that. Thanks! – Dennis Golomazov Jan 05 '18 at 19:20
  • 1
    When I try this on my own dataset, I keep getting `TypeError: object of type 'float' has no len()` at the very first step (`DataFrame(df.var1.str.split(',').tolist())`) – user5359531 Aug 23 '18 at 22:10
  • @user5359531 your dataset probably has some `NaN` in that column, so the replacement is `b = DataFrame(a.var1.str.split(',').values.tolist(), index=a.var2).stack()` – Flair Oct 01 '18 at 23:00
  • 1
    Just fyi [here's](https://medium.com/@sureshssarda/pandas-splitting-exploding-a-column-into-multiple-rows-b1b1d59ea12e) a nice write up of this solution with example. – hhbilly May 20 '19 at 22:58
  • Hi there, this code is good, but how about when we have more than 2 columns? – sygneto Apr 03 '20 at 09:05
  • did you find a solution for case where there is more than 2 columns? – rakesh Sep 08 '20 at 10:33
  • A little hard to understand your solution. What happens if I have more columns in my dataframe? – lfvv Nov 20 '20 at 20:10
  • I was having the same issues of having more than 2 columns. Here is the answer that worked for me: https://stackoverflow.com/a/50731258/1108903 – Mario Alberto Barrantes Quesad Jul 22 '21 at 14:41
145

Pandas >= 0.25

Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

Since you have a list of comma separated strings, split the string on comma to get a list of elements, then call explode on that column.

df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 'var2': [1, 2]})
df
    var1  var2
0  a,b,c     1
1  d,e,f     2

df.assign(var1=df['var1'].str.split(',')).explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

Note that explode only works on a single column (for now). To explode multiple columns at once, see below.

NaNs and empty lists get the treatment they deserve without you having to jump through hoops to get it right.

df = pd.DataFrame({'var1': ['d,e,f', '', np.nan], 'var2': [1, 2, 3]})
df
    var1  var2
0  d,e,f     1
1            2
2    NaN     3

df['var1'].str.split(',')

0    [d, e, f]
1           []
2          NaN

df.assign(var1=df['var1'].str.split(',')).explode('var1')

  var1  var2
0    d     1
0    e     1
0    f     1
1          2  # empty list entry becomes empty string after exploding 
2  NaN     3  # NaN left un-touched

This is a serious advantage over ravel/repeat -based solutions (which ignore empty lists completely, and choke on NaNs).


Exploding Multiple Columns

pandas 1.3 update

df.explode works on multiple columns starting from pandas 1.3:

df = pd.DataFrame({'var1': ['a,b,c', 'd,e,f'], 
                   'var2': ['i,j,k', 'l,m,n'], 
                   'var3': [1, 2]})
df
    var1   var2  var3
0  a,b,c  i,j,k     1
1  d,e,f  l,m,n     2

(df.set_index(['var3']) 
       .apply(lambda col: col.str.split(','))
       .explode(['var1', 'var2'])
       .reset_index()
       .reindex(df.columns, axis=1))

  var1 var2  var3
0    a    i     1
1    b    j     1
2    c    k     1
3    d    l     2
4    e    m     2
5    f    n     2

On older versions, you would move the explode column inside the apply which is a lot less performant:

(df.set_index(['var3']) 
   .apply(lambda col: col.str.split(',').explode())
   .reset_index()
   .reindex(df.columns, axis=1))

The idea is to set as the index, all the columns that should NOT be exploded, then explode the remaining columns via apply. This works well when the lists are equally sized.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    +1. But when i use a new column,it is not working as expected. Like `df.assign(var3=df['var1'].str.split(',')).explode('var1')` Can you please help? I replaced `var1` with `var3` – Avinash Sep 01 '20 at 14:10
  • @Avinash the argument to explode should also be var3. This is a basic problem, so spend a few minutes to understand why. – cs95 Sep 01 '20 at 16:42
  • If you have rows with strings and ints you need to .astype(str), otherwise you get NaN values for the ints. – Vega Jan 12 '21 at 13:25
  • 2
    I don't know why this simple, readable solution isn't the top voted answer, and a complex, difficult to understand one is. In my case for this to work all I had to do was assign the returned value like so `df = df.assign(var1=df['var1'].str.split(',')).explode('var1')`. I also split the process into multiple lines for readability. – Devyzr Apr 14 '21 at 18:17
  • Note: if you want to continue to use this exploded DataFrame then you need to explicitly assign it to itself i.e. `df = df.explode('var1')`. This does not support inplace function calling – matt123788 Aug 17 '21 at 16:01
  • Pandas supports multi-column explode [as of version 1.3.0](https://pandas.pydata.org/docs/whatsnew/v1.3.0.html#other-enhancements). – Ben Jul 19 '22 at 17:44
104

How about something like this:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Then you just have to rename the columns

Chang She
  • 16,692
  • 8
  • 40
  • 25
  • 1
    Looks like this is going to work. Thanks for your help! In general, though, is there a prefered approach to Split-Apply-Combine where Apply returns a dataframe of arbitrary size (but consistent for all chunks), and Combine just vstacks the returned DFs? – Vincent Oct 02 '12 at 00:22
  • GroupBy.apply should work (I just tried it against master). However, in this case you don't really need to go through the extra step of grouping since you're generating the data by row right? – Chang She Oct 02 '12 at 01:43
  • 1
    Hey guys. Sorry to jump into this so late but wondering if there is not a better solution to this. I'm trying to experiment with iterrows for the first time since that seems like the ticket for this. I'm also confused by the solution proposed. What does the "_" represent? Can you possibly explain how the solution works? --Thank you – horatio1701d Jun 25 '14 at 20:20
  • 18
    Can the solution be extended to more than two columns? – horatio1701d Jun 25 '14 at 21:54
  • Is there an implementation with the new API enhancements that might make this a little more performant? This implementation takes far too long to be practical on my large datasets. – horatio1701d Sep 03 '14 at 18:06
  • 1
    please check [this vectorized approach](http://stackoverflow.com/a/40449726/5741205)... – MaxU - stand with Ukraine Feb 02 '17 at 22:30
  • 1
    @horatio1701d, _ is a nameplace meaning that you will not take the output into account. here the output is the unpacking of iterrows() for each line, hence both the index and the content of the line. Hence you ignore the index and use the content of the line. The solution works by using a trick; the repeated content is actually passed as "data", the second parameter is a list and represent the index of the Series. I tried it out because I was a little puzzled too but this is clever, while not so efficient computationally-wise. – Ando Jurai May 30 '17 at 11:43
  • This approach seems to be very memory efficient, especially for big lists, see https://stackoverflow.com/a/54318064 for benchmarks. – krassowski Jan 23 '19 at 16:55
  • 5
    why does this sometimes work (on some dataframes) but does not seem to work on others? I have had it work on one dataset, but when trying on another now, I get "NameError: name 'Series' is not defined – bernando_vialli Jul 17 '19 at 17:01
  • @bernando_vialli it depends on how one has imported Series and DataFrame modules from pandas `Series` will work with `from pandas import DataFrame, Series` whereas the more popular `import pandas as pd` would need `pd.Series` – Gaurav Taneja Jun 25 '21 at 14:51
55

Here's a function I wrote for this common task. It's more efficient than the Series/stack methods. Column order and names are retained.

def tidy_split(df, column, sep='|', keep=False):
    """
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.

    Params
    ------
    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

    Returns
    -------
    pandas.DataFrame
        Returns a dataframe with the same columns as `df`.
    """
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

With this function, the original question is as simple as:

tidy_split(a, 'var1', sep=',')
Community
  • 1
  • 1
Daniel Himmelstein
  • 1,759
  • 1
  • 21
  • 26
19

Similar question as: pandas: How do I split text in a column into multiple rows?

You could do:

>> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
>> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
>> s.index = s.index.droplevel(-1)
>> del a['var1']
>> a.join(s)
   var2 var1
0     1    a
0     1    b
0     1    c
1     2    d
1     2    e
1     2    f
Community
  • 1
  • 1
inodb
  • 4,709
  • 2
  • 17
  • 17
19

There is a possibility to split and explode the dataframe without changing the structure of dataframe

Split and expand data of specific columns

Input:

    var1    var2
0   a,b,c   1
1   d,e,f   2



#Get the indexes which are repetative with the split 
df['var1'] = df['var1'].str.split(',')
df = df.explode('var1')

Out:

    var1    var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2

Edit-1

Split and Expand of rows for Multiple columns

Filename    RGB                                             RGB_type
0   A   [[0, 1650, 6, 39], [0, 1691, 1, 59], [50, 1402...   [r, g, b]
1   B   [[0, 1423, 16, 38], [0, 1445, 16, 46], [0, 141...   [r, g, b]

Re indexing based on the reference column and aligning the column value information with stack

df = df.reindex(df.index.repeat(df['RGB_type'].apply(len)))
df = df.groupby('Filename').apply(lambda x:x.apply(lambda y: pd.Series(y.iloc[0])))
df.reset_index(drop=True).ffill()

Out:

                Filename    RGB_type    Top 1 colour    Top 1 frequency Top 2 colour    Top 2 frequency
    Filename                            
 A  0       A   r   0   1650    6   39
    1       A   g   0   1691    1   59
    2       A   b   50  1402    49  187
 B  0       B   r   0   1423    16  38
    1       B   g   0   1445    16  46
    2       B   b   0   1419    16  39
Naga kiran
  • 4,528
  • 1
  • 17
  • 31
18

TL;DR

import pandas as pd
import numpy as np

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

Demonstration

explode_str(a, 'var1', ',')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

Let's create a new dataframe d that has lists

d = a.assign(var1=lambda d: d.var1.str.split(','))

explode_list(d, 'var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

General Comments

I'll use np.arange with repeat to produce dataframe index positions that I can use with iloc.

FAQ

Why don't I use loc?

Because the index may not be unique and using loc will return every row that matches a queried index.

Why don't you use the values attribute and slice that?

When calling values, if the entirety of the the dataframe is in one cohesive "block", Pandas will return a view of the array that is the "block". Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object. By using iloc instead of slicing the values attribute, I alleviate myself from having to deal with that.

Why do you use assign?

When I use assign using the same column name that I'm exploding, I overwrite the existing column and maintain its position in the dataframe.

Why are the index values repeat?

By virtue of using iloc on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
This can be reset with reset_index(drop=True)


For Strings

I don't want to have to split the strings prematurely. So instead I count the occurrences of the sep argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.

I then use that sep to join the strings then split.

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

For Lists

Similar as for strings except I don't need to count occurrences of sep because its already split.

I use Numpy's concatenate to jam the lists together.

import pandas as pd
import numpy as np

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I like this one. Really concise and the performance should be really good too. One question though: is df.iloc[i] the same as repeating rows of the dataframe or is it more efficient than that? Thanks! – Tim Mar 26 '19 at 21:51
8

I came up with a solution for dataframes with arbitrary numbers of columns (while still only separating one column's entries at a time).

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    '''
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
            row_accumulator.append(new_row)
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df
jlln
  • 153
  • 3
  • 7
4

Here is a fairly straightforward message that uses the split method from pandas str accessor and then uses NumPy to flatten each row into a single array.

The corresponding values are retrieved by repeating the non-split column the correct number of times with np.repeat.

var1 = df.var1.str.split(',', expand=True).values.ravel()
var2 = np.repeat(df.var2.values, len(var1) / len(df))

pd.DataFrame({'var1': var1,
              'var2': var2})

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
4

I have been struggling with out-of-memory experience using various way to explode my lists so I prepared some benchmarks to help me decide which answers to upvote. I tested five scenarios with varying proportions of the list length to the number of lists. Sharing the results below:

Time: (less is better, click to view large version)

Speed

Peak memory usage: (less is better)

Peak memory usage

Conclusions:

  • @MaxU's answer (update 2), codename concatenate offers the best speed in almost every case, while keeping the peek memory usage low,
  • see @DMulligan's answer (codename stack) if you need to process lots of rows with relatively small lists and can afford increased peak memory,
  • the accepted @Chang's answer works well for data frames that have a few rows but very large lists.

Full details (functions and benchmarking code) are in this GitHub gist. Please note that the benchmark problem was simplified and did not include splitting of strings into the list - which most solutions performed in a similar fashion.

krassowski
  • 13,598
  • 4
  • 60
  • 92
4

One-liner using split(___, expand=True) and the level and name arguments to reset_index():

>>> b = a.var1.str.split(',', expand=True).set_index(a.var2).stack().reset_index(level=0, name='var1')
>>> b
   var2 var1
0     1    a
1     1    b
2     1    c
0     2    d
1     2    e
2     2    f

If you need b to look exactly like in the question, you can additionally do:

>>> b = b.reset_index(drop=True)[['var1', 'var2']]
>>> b
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2
1''
  • 26,823
  • 32
  • 143
  • 200
2

Based on the excellent @DMulligan's solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order function from this answer.

def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

def split_df(dataframe, col_name, sep):
    orig_col_index = dataframe.columns.tolist().index(col_name)
    orig_index_name = dataframe.index.name
    orig_columns = dataframe.columns
    dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
    index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
    df_split = pd.DataFrame(
        pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
        .stack().reset_index(level=1, drop=1), columns=[col_name])
    df = dataframe.drop(col_name, axis=1)
    df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
    df = df.set_index(index_col_name)
    df.index.name = orig_index_name
    # merge adds the column to the last place, so we need to move it back
    return change_column_order(df, col_name, orig_col_index)

Example:

df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], 
                  columns=['Name', 'A', 'B'], index=[10, 12, 13])
df
        Name    A   B
    10   a:b     1   4
    12   c:d     2   5
    13   e:f:g:h 3   6

split_df(df, 'Name', ':')
    Name    A   B
10   a       1   4
10   b       1   4
12   c       2   5
12   d       2   5
13   e       3   6
13   f       3   6    
13   g       3   6    
13   h       3   6    

Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.

Dennis Golomazov
  • 16,269
  • 5
  • 73
  • 81
  • 2
    this cracked this one for me, nice work: https://stackoverflow.com/a/48554655/6672746 – Evan Feb 01 '18 at 03:07
2

The string function split can take an option boolean argument 'expand'.

Here is a solution using this argument:

(a.var1
  .str.split(",",expand=True)
  .set_index(a.var2)
  .stack()
  .reset_index(level=1, drop=True)
  .reset_index()
  .rename(columns={0:"var1"}))
cs95
  • 379,657
  • 97
  • 704
  • 746
cgels
  • 43
  • 6
2

I do appreciate the answer of "Chang She", really, but the iterrows() function takes long time on large dataset. I faced that issue and I came to this.

# First, reset_index to make the index a column
a = a.reset_index().rename(columns={'index':'duplicated_idx'})

# Get a longer series with exploded cells to rows
series = pd.DataFrame(a['var1'].str.split('/')
                      .tolist(), index=a.duplicated_idx).stack()

# New df from series and merge with the old one
b = series.reset_index([0, 'duplicated_idx'])
b = b.rename(columns={0:'var1'})

# Optional & Advanced: In case, there are other columns apart from var1 & var2
b.merge(
    a[a.columns.difference(['var1'])],
    on='duplicated_idx')

# Optional: Delete the "duplicated_index"'s column, and reorder columns
b = b[a.columns.difference(['duplicated_idx'])]
2

One-liner using assign and explode:

    col1  col2
0  a,b,c     1
1  d,e,f     2

df.assign(col1 = df.col1.str.split(',')).explode('col1', ignore_index=True)

Output:

  col1  col2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
1

Just used jiln's excellent answer from above, but needed to expand to split multiple columns. Thought I would share.

def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split

returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
The values in the other columns are duplicated across the newly divided rows.
'''
def splitListToRows(row, row_accumulator, target_columns, separator):
    split_rows = []
    for target_column in target_columns:
        split_rows.append(row[target_column].split(separator))
    # Seperate for multiple columns
    for i in range(len(split_rows[0])):
        new_row = row.to_dict()
        for j in range(len(split_rows)):
            new_row[target_columns[j]] = split_rows[j][i]
        row_accumulator.append(new_row)
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df
1

upgraded MaxU's answer with MultiIndex support

def explode(df, lst_cols, fill_value='', preserve_index=False):
    """
    usage:
        In [134]: df
        Out[134]:
           aaa  myid        num          text
        0   10     1  [1, 2, 3]  [aa, bb, cc]
        1   11     2         []            []
        2   12     3     [1, 2]      [cc, dd]
        3   13     4         []            []

        In [135]: explode(df, ['num','text'], fill_value='')
        Out[135]:
           aaa  myid num text
        0   10     1   1   aa
        1   10     1   2   bb
        2   10     1   3   cc
        3   11     2
        4   12     3   1   cc
        5   12     3   2   dd
        6   13     4
    """
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
                index=idx)
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
                  .fillna(fill_value))
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)

    # if original index is MultiIndex build the dataframe from the multiindex
    # create "exploded" DF
    if isinstance(df.index, pd.MultiIndex):
        res = res.reindex(
            index=pd.MultiIndex.from_tuples(
                res.index,
                names=['number', 'color']
            )
    )
    return res
1

My version of the solution to add to this collection! :-)

# Original problem
from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])
### My solution
import pandas as pd
import functools
def expand_on_cols(df, fuse_cols, delim=","):
    def expand_on_col(df, fuse_col):
        col_order = df.columns
        df_expanded = pd.DataFrame(
            df.set_index([x for x in df.columns if x != fuse_col])[fuse_col]
            .apply(lambda x: x.split(delim))
            .explode()
        ).reset_index()
        return df_expanded[col_order]
    all_expanded = functools.reduce(expand_on_col, fuse_cols, df)
    return all_expanded

assert(b.equals(expand_on_cols(a, ["var1"], delim=",")))
darkhipo
  • 1,384
  • 1
  • 14
  • 19
1

A short and simple way to change the format of the column using .apply() so that it can be used by .explod():

import string
import pandas as pd
from io import StringIO

file = StringIO("""    var1  var2
0  a,b,c     1
1  d,e,f     2""")

df = pd.read_csv(file, sep=r'\s\s+')

df['var1'] = df['var1'].apply(lambda x : str(x).split(','))

df.explode('var1')

Output:

  var1  var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2
ali bakhtiari
  • 1,051
  • 4
  • 23
0

I have come up with the following solution to this problem:

def iter_var1(d):
    for _, row in d.iterrows():
        for v in row["var1"].split(","):
            yield (v, row["var2"])

new_a = DataFrame.from_records([i for i in iter_var1(a)],
        columns=["var1", "var2"])
Pavel
  • 9
  • 2
0

Another solution that uses python copy package

import copy
new_observations = list()
def pandas_explode(df, column_to_explode):
    new_observations = list()
    for row in df.to_dict(orient='records'):
        explode_values = row[column_to_explode]
        del row[column_to_explode]
        if type(explode_values) is list or type(explode_values) is tuple:
            for explode_value in explode_values:
                new_observation = copy.deepcopy(row)
                new_observation[column_to_explode] = explode_value
                new_observations.append(new_observation) 
        else:
            new_observation = copy.deepcopy(row)
            new_observation[column_to_explode] = explode_values
            new_observations.append(new_observation) 
    return_df = pd.DataFrame(new_observations)
    return return_df

df = pandas_explode(df, column_name)
0

There are a lot of answers here but I'm surprised no one has mentioned the built in pandas explode function. Check out the link below: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.explode.html#pandas.DataFrame.explode

For some reason I was unable to access that function, so I used the below code:

import pandas_explode
pandas_explode.patch()
df_zlp_people_cnt3 = df_zlp_people_cnt2.explode('people')

enter image description here

Above is a sample of my data. As you can see the people column had series of people, and I was trying to explode it. The code I have given works for list type data. So try to get your comma separated text data into list format. Also since my code uses built in functions, it is much faster than custom/apply functions.

Note: You may need to install pandas_explode with pip.

Harsha Reddy
  • 391
  • 5
  • 8
0

I had a similar problem, my solution was converting the dataframe to a list of dictionaries first, then do the transition. Here is the function:

import re
import pandas as pd

def separate_row(df, column_name):
    ls = []
    for row_dict in df.to_dict('records'):
        for word in re.split(',', row_dict[column_name]):
            row = row_dict.copy()
            row[column_name]=word
            ls.append(row)
    return pd.DataFrame(ls)

Example:

>>> from pandas import DataFrame
>>> import numpy as np
>>> a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
>>> a
    var1  var2
0  a,b,c     1
1  d,e,f     2
>>> separate_row(a, "var1")
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

You can also change the function a bit to support separating list type rows.

Zhiwei
  • 179
  • 1
  • 7
0

Upon adding few bits and pieces from all the solutions on this page, I was able to get something like this(for someone who need to use it right away). parameters to the function are df(input dataframe) and key(column that has delimiter separated string). Just replace with your delimiter if that is different to semicolon ";".

def split_df_rows_for_semicolon_separated_key(key, df):
    df=df.set_index(df.columns.drop(key,1).tolist())[key].str.split(';', expand=True).stack().reset_index().rename(columns={0:key}).loc[:, df.columns]
    df=df[df[key] != '']
    return df
Kranti
  • 36
  • 5
0

Try:

vals = np.array(a.var1.str.split(",").values.tolist())    
var = np.repeat(a.var2, vals.shape[1])

out = pd.DataFrame(np.column_stack((var, vals.ravel())), columns=a.columns)
display(out)

      var1 var2
    0   1   a
    1   1   b
    2   1   c
    3   2   d
    4   2   e
    5   2   f

Hamza usman ghani
  • 2,264
  • 5
  • 19
0

In recent version of pandas you can use split followed by explode

a.assign(var1=a['var1'].str.split(',')).explode('var1')

a

   var1 var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2