Splitting a cell in pandas into multiple rows

Question

It was a bit tricky to explain the problem. I want to split a cell containing multiple string values delimited by commas into different rows. The df below is a small example but the real dataset contains up to 15 columns and 15 rows and each cell has 5 to 6 non unique string values in it that I need to separate into different rows.

How can we split the original df to the transformed df?

Original df

import pandas as pd
df = pd.DataFrame({"Privileges":['Type1','Type2','Type3'],"Super_Admin":["A1,A2,A3,A4","A1,B1,B2,B3, A4","C1,B2,C2,C3"], "Admin":["A1,A2","A1,B1,B2","B2, C1,C2"])

Index	Privileges	Super_Admin	Admin
0	Type1	A1,A2,A3,A4	A1,A2
1	Type2	A1,B1,B2,B3, A4	A1,B1,B2
2	Type3	C1,B2,C2,C3	B2, C1,C2

Transformed df

df = pd.DataFrame({"Privileges":['Type1','Type1','Type1','Type1','Type2','Type2','Type2','Type2','Type2','Type3','Type3','Type3','Type3'],"Super_Admin":["A1","A2","A3","A4","A1", "B1","B2","B3", "A4","C1","B2","C2","C3"], "Admin":["A1","A2",'', '',"A1","B1","B2",'', '', "B2", "C1","C2", '']})

Index	Privileges	Super_Admin	Admin
0	Type1	A1	A1
1	Type1	A2	A2
2	Type1	A3	--
3	Type1	A4	--
4	Type2	A1	A1
5	Type2	B1	B1
6	Type2	B2	B2
7	Type2	B3	--
8	Type2	A4	--
9	Type3	C1	B2
10	Type3	B2	C1
11	Type3	C2	C2
12	Type3	C3	--

Thank You for sharing the dataframe code as well. Most people just share the picture or text dataframe. You have the full code. It helps us to quickly work on the solution. — Joe Ferndz, Mar 30 '21 at 06:55
Should Type 3 have Super_Admin and Admin match the values. For ex: it has C1 B2 and B2 C1. Should it be C1 C1 and B2 B2 instead? — Joe Ferndz, Mar 30 '21 at 07:52
See the post from a few years ago on [exploding multiple columns...](https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe) — Joe Ferndz, Mar 30 '21 at 07:54

sammywemmy · Answer 1 · 2021-03-30T08:12:22.740

Breakdown of the steps below:

zip all the columns

Use zip_longest to pair None with values that do not have a pair

Combine the two lists into one, with chain

Create dataframe and forward fill on the Privileges column

In [541]: step1 = zip(df.Privileges, df.Super_Admin, df.Admin)

In [542]: step2 = (zip_longest([first], second,last) 
                   for first, second, last 
                   in step1)

In [543]: step3 = chain.from_iterable(step2)

In [546]: (pd.DataFrame(step3, columns = df.columns)
             .assign(Privileges = lambda df: df.Privileges.ffill())
           )
Out[546]: 
   Privileges Super_Admin Admin
0       Type1          A1    A1
1       Type1          A2    A2
2       Type1          A3  None
3       Type1          A4  None
4       Type2          A1    A1
5       Type2          B1    B1
6       Type2          B2    B2
7       Type2          B3  None
8       Type2          A4  None
9       Type3          C1    B2
10      Type3          B2    C1
11      Type3          C2    C2
12      Type3          C3  None

To get more speed, you can move the split step into native python territory. pandas string methods are wrappers around python's string functions, as such they are not as fast as python's string functions.

Too fast :) I just got started and you already have it. Mine would have been too complex. This looks neat. upvoted. — Joe Ferndz, Mar 30 '21 at 06:56

score 2 · Answer 2 · answered Mar 30 '21 at 06:58

This is on option. First make Super_Admin and Admin to a list. This is useful to use pd.explode().

df['Super_Admin'] = df['Super_Admin'].apply(lambda x: x.split(','))
df['Admin'] = df['Admin'].apply(lambda x: x.split(','))

Then call explode on both columns and fill the missing values with an empty string.

a = df.explode('Super_Admin')
b = df.explode('Admin')
for i in range(3):
    short = b.loc[i,'Admin'].values
    long = a.loc[i,'Admin'].values
    a.loc[i,'Admin'] = np.concatenate((short, ['']*(len(long)-len(short))), axis=0)

The output looks like this:

>>> a
  Privileges Super_Admin Admin
0      Type1          A1    A1
0      Type1          A2    A2
0      Type1          A3      
0      Type1          A4      
1      Type2          A1    A1
1      Type2          B1    B1
1      Type2          B2    B2
1      Type2          B3      
1      Type2          A4      
2      Type3          C1    B2
2      Type3          B2    C1
2      Type3          C2    C2
2      Type3          C3

Valdi_Bo · Answer 3 · 2021-03-30T08:36:15.867

You can do the job with a single although substantially chained instruction:

result = df.set_index('Privileges').apply(lambda col:
    col.str.split(', ?', expand=True).stack())\
    .droplevel(1).reset_index().fillna('')

Steps:

df.set_index('Privileges') - Set Privileges as the index column.
apply(lambda col: - Apply to each column:
- col.str.split(', ?', expand=True) - Break this column into separate columns (a DataFrame, with column names as consecutive integers).
- stack() - Convert the above DataFrame into a Series with column names as the second MultiIndex level.
droplevel(1) - Drop the unncessary index level.
reset_index() - Change Privileges (the index column) to a regular column.
fillna('') - Change each NaN into an empty string.

The result is:

   Privileges Super_Admin Admin
0       Type1          A1    A1
1       Type1          A2    A2
2       Type1          A3      
3       Type1          A4      
4       Type2          A1    A1
5       Type2          B1    B1
6       Type2          B2    B2
7       Type2          B3      
8       Type2          A4      
9       Type3          C1    B2
10      Type3          B2    C1
11      Type3          C2    C2
12      Type3          C3

Splitting a cell in pandas into multiple rows

3 Answers3