57

I have a dataframe contains orders data, each order has multiple packages stored as comma separated string [package & package_code] columns

I want to split the packages data and create a row for each package including its order details

Here is a sample input dataframe:

import pandas as pd
df = pd.DataFrame({"order_id":[1,3,7],"order_date":["20/5/2018","22/5/2018","23/5/2018"], "package":["p1,p2,p3","p4","p5,p6"],"package_code":["#111,#222,#333","#444","#555,#666"]})

Input Dataframe

And this is what I am trying to achieve as output: Output

How can I do that with pandas?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Nobel
  • 1,485
  • 1
  • 16
  • 19

5 Answers5

100

pandas >= 0.25

Assuming all splittable columns have the same number of comma separated items, you can split on comma and then use Series.explode on each column:

(df.set_index(['order_id', 'order_date'])
   .apply(lambda x: x.str.split(',').explode())
   .reset_index())                                                   

   order_id order_date package package_code
0         1  20/5/2018      p1         #111
1         1  20/5/2018      p2         #222
2         1  20/5/2018      p3         #333
3         3  22/5/2018      p4         #444
4         7  23/5/2018      p5         #555
5         7  23/5/2018      p6         #666

Details

Set the columns not to be touched as the index,

df.set_index(['order_id', 'order_date'])

                      package    package_code
order_id order_date                          
1        20/5/2018   p1,p2,p3  #111,#222,#333
3        22/5/2018         p4            #444
7        23/5/2018      p5,p6       #555,#666

The next step is a 2-step process: Split on comma to get a column of lists, then call explode to explode the list values into their own rows.

_.apply(lambda x: x.str.split(',').explode())

                    package package_code
order_id order_date                     
1        20/5/2018       p1         #111
         20/5/2018       p2         #222
         20/5/2018       p3         #333
3        22/5/2018       p4         #444
7        23/5/2018       p5         #555
         23/5/2018       p6         #666

Finally, reset the index.

_.reset_index()

   order_id order_date package package_code
0         1  20/5/2018      p1         #111
1         1  20/5/2018      p2         #222
2         1  20/5/2018      p3         #333
3         3  22/5/2018      p4         #444
4         7  23/5/2018      p5         #555
5         7  23/5/2018      p6         #666

pandas <= 0.24

This should work for any number of columns like this. The essence is a little stack-unstacking magic with str.split.

(df.set_index(['order_date', 'order_id'])
   .stack()
   .str.split(',', expand=True)
   .stack()
   .unstack(-2)
   .reset_index(-1, drop=True)
   .reset_index()
)

  order_date  order_id package package_code
0  20/5/2018         1      p1         #111
1  20/5/2018         1      p2         #222
2  20/5/2018         1      p3         #333
3  22/5/2018         3      p4         #444
4  23/5/2018         7      p5         #555
5  23/5/2018         7      p6         #666

There is another performant alternative involving chain, but you'd need to explicitly chain and repeat every column (a bit of a problem with a lot of columns). Choose whatever fits the description of your problem best, as there's no single answer.

Details

First, set the columns that are not to be touched as the index.

df.set_index(['order_date', 'order_id'])
 
                      package    package_code
order_date order_id                          
20/5/2018  1         p1,p2,p3  #111,#222,#333
22/5/2018  3               p4            #444
23/5/2018  7            p5,p6       #555,#666

Next, stack the rows.

_.stack()

order_date  order_id              
20/5/2018   1         package               p1,p2,p3
                      package_code    #111,#222,#333
22/5/2018   3         package                     p4
                      package_code              #444
23/5/2018   7         package                  p5,p6
                      package_code         #555,#666
dtype: object

We have a series now. So call str.split on comma.

_.str.split(',', expand=True)

                                     0     1     2
order_date order_id                               
20/5/2018  1        package         p1    p2    p3
                    package_code  #111  #222  #333
22/5/2018  3        package         p4  None  None
                    package_code  #444  None  None
23/5/2018  7        package         p5    p6  None
                    package_code  #555  #666  None

We need to get rid of NULL values, so call stack again.

_.stack()

order_date  order_id                 
20/5/2018   1         package       0      p1
                                    1      p2
                                    2      p3
                      package_code  0    #111
                                    1    #222
                                    2    #333
22/5/2018   3         package       0      p4
                      package_code  0    #444
23/5/2018   7         package       0      p5
                                    1      p6
                      package_code  0    #555
                                    1    #666
dtype: object

We're almost there. Now we want the second last level of the index to become our columns, so unstack using unstack(-2) (unstack on the second last level)

_.unstack(-2)

                      package package_code
order_date order_id                       
20/5/2018  1        0      p1         #111
                    1      p2         #222
                    2      p3         #333
22/5/2018  3        0      p4         #444
23/5/2018  7        0      p5         #555
                    1      p6         #666

Get rid of the superfluous last level using reset_index:

_.reset_index(-1, drop=True)

                    package package_code
order_date order_id                     
20/5/2018  1             p1         #111
           1             p2         #222
           1             p3         #333
22/5/2018  3             p4         #444
23/5/2018  7             p5         #555
           7             p6         #666

And finally,

_.reset_index()

  order_date  order_id package package_code
0  20/5/2018         1      p1         #111
1  20/5/2018         1      p2         #222
2  20/5/2018         1      p3         #333
3  22/5/2018         3      p4         #444
4  23/5/2018         7      p5         #555
5  23/5/2018         7      p6         #666
cs95
  • 379,657
  • 97
  • 704
  • 746
  • can you add details – Pyd Jun 11 '18 at 05:52
  • 1
    @pyd what specifically do you want to know about this? – cs95 Jun 11 '18 at 05:53
  • 1
    what is happening in unstacking with `-2` and reset_index with `-1` – Pyd Jun 11 '18 at 05:58
  • what id set_index() throw "Index contains duplicate entries, cannot reshape". is there anyway around it? – Moj May 07 '19 at 12:27
  • @Moj the idea is the index being set should be unique (like a primary key). One idea would be to add a 3rd index level created using groupby and cumcount(not at my machine, so cannot test it). – cs95 May 07 '19 at 15:59
  • So I have a dataframe with 196 columns, one column has comma separated values. I want to split that comma separated value into different rows and duplicate the values in the other 195 columns to each of these new rows. How can I achieve that without having to specify each column name in set_index – Adarsh Ravi May 15 '19 at 14:58
  • 1
    @AdarshRavi You can do something like this: `keep_columns = list(set(df.columns) - set([reshape_columns]))` – xApple May 29 '19 at 15:04
  • 4
    I updated pandas (version **1.0.3**) and the newest form doesn't seem to work anymore: `ValueError: cannot handle a non-unique multi-index!`. – cglacet Apr 26 '20 at 20:34
  • @cglacet I just tested the example with pandas 1.1.4 and it runs through. But while I did the split with " " instead of "," I had the same error message. The problem was, that I used a pandas DataFrame which has been imported from a csv-file. And then in some columns, there had been additional empty spaces before or after the entries of the list. In consequence the number of resulting list entries in the two columns after the spilt have been different. Since I produced the export-import-file myself, I now created a pickle instead of a csv-file. Now the explode worked like I wanted to. – Thomas R Mar 08 '21 at 16:45
28

Here's one way using numpy.repeat and itertools.chain. Conceptually, this is exactly what you want to do: repeat some values, chain others. Recommended for small numbers of columns, otherwise stack based methods may fare better.

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = df['package'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'order_id': np.repeat(df['order_id'], lens),
                    'order_date': np.repeat(df['order_date'], lens),
                    'package': chainer(df['package']),
                    'package_code': chainer(df['package_code'])})

print(res)

   order_id order_date package package_code
0         1  20/5/2018      p1         #111
0         1  20/5/2018      p2         #222
0         1  20/5/2018      p3         #333
1         3  22/5/2018      p4         #444
2         7  23/5/2018      p5         #555
2         7  23/5/2018      p6         #666
jpp
  • 159,742
  • 34
  • 281
  • 339
  • what if there are two columns that need to be expanded and have different lens? – Moj May 07 '19 at 12:20
  • 2
    @Moj, That problem isn't well defined. You can't, for example align 3 values with 5 values with a 1-to-1 mapping. I suggest you ask a new question specifying precisely your desired output, if your question hasn't been answered elsewhere. – jpp May 07 '19 at 13:48
18

Have a look at today's pandas release 0.25 : https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows

df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1}, {'var1': 'd,e,f', 'var2': 2}])
df_splitted = df.assign(var1=df.var1.str.split(',')).explode('var1').reset_index(drop=True)
print(df_splitted)
kush
  • 486
  • 4
  • 6
Heraknos
  • 343
  • 3
  • 8
  • This is neat. It would be nice to know (a) how Pandas has implemented this function internally, (b) how this solution compares with others in terms of performance. – jpp Sep 17 '19 at 19:47
  • 1
    Only one column can be exploded this way – cs95 Sep 24 '19 at 14:50
  • @cs95 No, it's possible to explode on multiple columns by adapting like this : ```df = pd.DataFrame([{'var1': 'a,b,c','var3': 'x1,x2,x3', 'var2': 1}, {'var1': 'd,e,f','var3': 'x1,x2,x4', 'var2': 2}]) df.assign(var1=df.var1.str.split(','), var3=df.var3.str.split(',')).explode('var1').explode('var3').reset_index(drop=True)``` – Heraknos Sep 25 '19 at 08:29
  • 1
    The result is incorrect, each column is exploded separately rather than in tandem with each other, producing more rows in the result than desired. – cs95 Sep 25 '19 at 15:29
  • I don't understand, can you put an exemple of data and result wanted ? – Heraknos Sep 26 '19 at 15:50
  • Just run your own code and see. Anyway, if you want to do this for multiple columns, you can use apply. See my edited answer. – cs95 Nov 11 '19 at 22:28
7

Given that explode only affects list columns anyway, a simple solution is:

# Convert columns of interest to list columns
d["package"]      = d["package"].str.split(",")
d["package_code"] = d["package_code"].str.split(",")

# Explode the entire data frame
d = d.apply( pandas.Series.explode )

Advantages:

  • Avoids having to moving the core data to an index to "keep it out the way" and therefore doesn't fail with a "duplicate index" error when the data contains repeats.

Disadvantages:

  • Only works if there are no list columns already in the data (although this is almost always the case).
c z
  • 7,726
  • 3
  • 46
  • 59
  • This should be the accepted answer. Using `set_index` also re-orders the columns, if any of the "index" columns happen to come after the columns you want to explode. – James McKinney Apr 25 '22 at 23:53
5

Close to cold's method :-)

df.set_index(['order_date','order_id']).apply(lambda x : x.str.split(',')).stack().apply(pd.Series).stack().unstack(level=2).reset_index(level=[0,1])
Out[538]: 
  order_date  order_id package package_code
0  20/5/2018         1      p1         #111
1  20/5/2018         1      p2         #222
2  20/5/2018         1      p3         #333
0  22/5/2018         3      p4         #444
0  23/5/2018         7      p5         #555
1  23/5/2018         7      p6         #666
BENY
  • 317,841
  • 20
  • 164
  • 234
  • @coldspeed yep , I will going to highlight it again in my mind (apply is bad ! :-() – BENY Jun 07 '18 at 01:03
  • If `apply(pd.Series)` is bad, isn't `stack` worse? I've seen it often underperform simple iteration! – jpp Jun 07 '18 at 09:03
  • @jpp yeah it was general advice to Wen (he gets it) because he uses it a bit more than he should, and you'd expect a 60k user with a gold badge to want to propagate good practices where possible ;-) sure stack sucks, but its flexibility here is useful. – cs95 Jun 07 '18 at 12:57