0

I am trying to split multiple columns into multiple rows in few lines rather than writing a big 'def function' for it. I have 2 columns that need splitting by ;. I am trying to follow this: Split cell into multiple rows in pandas dataframe

from pandas import DataFrame
import numpy as np
from itertools import chain
import pandas as pd

a = DataFrame([{'var1': 'a;b;c', 'var2': 1,'var3':'apples;mango'},
            {'var1': 'd;e;f', 'var2': 2,'var3':'kiwi;pineapple'},
            {'var1': 'g;e;a', 'var2': 15,'var3':'pinneapple'},
            {'var1': 'm', 'var2': 12,'var3':'orange'}])
a

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(';')))

# calculate lengths of splits
lens = a['var1'].str.split(';').map(len)

# create new dataframe, repeating or chaining as appropriate
new_df = pd.DataFrame({'var1': chainer(a['var1']),
                    'var2': np.repeat(a['var2'], lens),
                    'var3': chainer(a['var3'])
                })

print(new_df)

#ERROR: ValueError: arrays must all be same length

Looking for Output like this: enter image description here

sharp
  • 2,140
  • 9
  • 43
  • 80
  • the main difference you have with the solution of [jpp](https://stackoverflow.com/a/50731258/9274732) you seem to try is that you want the Cartesian product of the values in both `var1` and `var3` (for example the first row has 3*2 rows in the output) while the solution is written when any value in `package` is associated to one value in `package code`. – Ben.T Dec 13 '18 at 20:12

3 Answers3

1

A product is more appropriated here :

import itertools

def splitter(values):
    return list(itertools.product( *[str(v).split(';') for v in values]))

In [6]: splitter(df.iloc[2]) #example
Out[6]: 
[('g', '15', 'pinneapple'),
 ('e', '15', 'pinneapple'),
 ('a', '15', 'pinneapple')]

Now, do it for all rows and rebuild a fresh dataframe:

def expand(df):
    tuples=list()
    for i,row in df.iterrows():
        tuples.extend(splitter(row))
    return  pd.DataFrame.from_records(tuples,columns=df.columns)

Result :

In [7]: expand(df)
   var  var2        var3
0     a    1      apples
1     a    1       mango
2     b    1      apples
3     b    1       mango
4     c    1      apples
5     c    1       mango
6     d    2        kiwi
7     d    2   pineapple
8     e    2        kiwi
9     e    2   pineapple
10    f    2        kiwi
11    f    2   pineapple
12    g   15  pinneapple
13    e   15  pinneapple
14    a   15  pinneapple
15    m   12      orange
B. M.
  • 18,243
  • 2
  • 35
  • 54
0

Try the following code:

import pandas as pd

# Source data
a = pd.DataFrame([{'var1': 'a;b;c', 'var2': 1,'var3':'apples;mango'},
    {'var1': 'd;e;f', 'var2':  2,'var3':'kiwi;pineapple'},
    {'var1': 'g;e;a', 'var2': 15,'var3':'pinneapple'},
    {'var1': 'm',     'var2': 12,'var3':'orange'}])
# Split var1
a2 = a.var1.apply(lambda t: pd.Series(t.split(';')))\
    .merge(a, right_index = True, left_index = True)\
    .drop(['var1'], axis = 1)\
    .melt(id_vars = ['var2', 'var3'], value_name = 'var1')\
    .drop('variable', axis = 1).dropna()
# Split var3
a3 = a2.var3.apply(lambda t: pd.Series(t.split(';')))\
    .merge(a2, right_index = True, left_index = True)\
    .drop(['var3'], axis = 1)\
    .melt(id_vars = ['var1', 'var2'], value_name = 'var3')\
    .drop('variable', axis = 1).dropna()
# Sort the result
a3.sort_values(['var2', 'var3'])

a2 contains a table with var1 divided into separate rows.

The same operation for var3 performs the next instruction (similar to the previous with changed names.

The last step is to sort the result.

To understand how this code works, perform separately each step of one of the chained instructions.

If you have more columns to be divided, add analogous "splitting" instructions for them.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

You can join each column once you have use str.split and stack on each column. This can be done by first defining a function to get the column in the proper shape:

def split_stack_col (a_col):
    return (a_col.astype(str).str.split(';',expand=True).stack()
                 .reset_index(level=1,name=a_col.name)[[a_col.name]])

Then you map this function to each column before using reduce with join. Add a reset_index if needed

from functools import reduce
new_df = reduce(lambda x,y: x.join(y), 
                map( split_stack_col, 
                     (a[col] for col in a.columns))).reset_index(drop=True)

This method can be used for any number of column in a, and you get as expected:

print (new_df)
   var1 var2        var3
0     a    1      apples
1     a    1       mango
2     b    1      apples
3     b    1       mango
4     c    1      apples
5     c    1       mango
6     d    2        kiwi
7     d    2   pineapple
8     e    2        kiwi
9     e    2   pineapple
10    f    2        kiwi
11    f    2   pineapple
12    g   15  pinneapple
13    e   15  pinneapple
14    a   15  pinneapple
15    m   12      orange
Ben.T
  • 29,160
  • 6
  • 32
  • 54