4

It was a bit tricky to explain the problem. I want to split a cell containing multiple string values delimited by commas into different rows. The df below is a small example but the real dataset contains up to 15 columns and 15 rows and each cell has 5 to 6 non unique string values in it that I need to separate into different rows.

How can we split the original df to the transformed df?

Original df

import pandas as pd
df = pd.DataFrame({"Privileges":['Type1','Type2','Type3'],"Super_Admin":["A1,A2,A3,A4","A1,B1,B2,B3, A4","C1,B2,C2,C3"], "Admin":["A1,A2","A1,B1,B2","B2, C1,C2"])
Index Privileges Super_Admin Admin
0 Type1 A1,A2,A3,A4 A1,A2
1 Type2 A1,B1,B2,B3, A4 A1,B1,B2
2 Type3 C1,B2,C2,C3 B2, C1,C2

Transformed df

df = pd.DataFrame({"Privileges":['Type1','Type1','Type1','Type1','Type2','Type2','Type2','Type2','Type2','Type3','Type3','Type3','Type3'],"Super_Admin":["A1","A2","A3","A4","A1", "B1","B2","B3", "A4","C1","B2","C2","C3"], "Admin":["A1","A2",'', '',"A1","B1","B2",'', '', "B2", "C1","C2", '']})
Index Privileges Super_Admin Admin
0 Type1 A1 A1
1 Type1 A2 A2
2 Type1 A3 --
3 Type1 A4 --
4 Type2 A1 A1
5 Type2 B1 B1
6 Type2 B2 B2
7 Type2 B3 --
8 Type2 A4 --
9 Type3 C1 B2
10 Type3 B2 C1
11 Type3 C2 C2
12 Type3 C3 --
user14318465
  • 41
  • 1
  • 5
  • 1
    Thank You for sharing the dataframe code as well. Most people just share the picture or text dataframe. You have the full code. It helps us to quickly work on the solution. – Joe Ferndz Mar 30 '21 at 06:55
  • Should Type 3 have Super_Admin and Admin match the values. For ex: it has C1 B2 and B2 C1. Should it be C1 C1 and B2 B2 instead? – Joe Ferndz Mar 30 '21 at 07:52
  • See the post from a few years ago on [exploding multiple columns...](https://stackoverflow.com/questions/45846765/efficient-way-to-unnest-explode-multiple-list-columns-in-a-pandas-dataframe) – Joe Ferndz Mar 30 '21 at 07:54

3 Answers3

4

Breakdown of the steps below:

zip all the columns

Use zip_longest to pair None with values that do not have a pair

Combine the two lists into one, with chain

Create dataframe and forward fill on the Privileges column

In [541]: step1 = zip(df.Privileges, df.Super_Admin, df.Admin)

In [542]: step2 = (zip_longest([first], second,last) 
                   for first, second, last 
                   in step1)

In [543]: step3 = chain.from_iterable(step2)

In [546]: (pd.DataFrame(step3, columns = df.columns)
             .assign(Privileges = lambda df: df.Privileges.ffill())
           )
Out[546]: 
   Privileges Super_Admin Admin
0       Type1          A1    A1
1       Type1          A2    A2
2       Type1          A3  None
3       Type1          A4  None
4       Type2          A1    A1
5       Type2          B1    B1
6       Type2          B2    B2
7       Type2          B3  None
8       Type2          A4  None
9       Type3          C1    B2
10      Type3          B2    C1
11      Type3          C2    C2
12      Type3          C3  None

To get more speed, you can move the split step into native python territory. pandas string methods are wrappers around python's string functions, as such they are not as fast as python's string functions.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Too fast :) I just got started and you already have it. Mine would have been too complex. This looks neat. upvoted. – Joe Ferndz Mar 30 '21 at 06:56
2

This is on option. First make Super_Admin and Admin to a list. This is useful to use pd.explode().

df['Super_Admin'] = df['Super_Admin'].apply(lambda x: x.split(','))
df['Admin'] = df['Admin'].apply(lambda x: x.split(','))

Then call explode on both columns and fill the missing values with an empty string.

a = df.explode('Super_Admin')
b = df.explode('Admin')
for i in range(3):
    short = b.loc[i,'Admin'].values
    long = a.loc[i,'Admin'].values
    a.loc[i,'Admin'] = np.concatenate((short, ['']*(len(long)-len(short))), axis=0)

The output looks like this:

>>> a
  Privileges Super_Admin Admin
0      Type1          A1    A1
0      Type1          A2    A2
0      Type1          A3      
0      Type1          A4      
1      Type2          A1    A1
1      Type2          B1    B1
1      Type2          B2    B2
1      Type2          B3      
1      Type2          A4      
2      Type3          C1    B2
2      Type3          B2    C1
2      Type3          C2    C2
2      Type3          C3      
mosc9575
  • 5,618
  • 2
  • 9
  • 32
2

You can do the job with a single although substantially chained instruction:

result = df.set_index('Privileges').apply(lambda col:
    col.str.split(', ?', expand=True).stack())\
    .droplevel(1).reset_index().fillna('')

Steps:

  • df.set_index('Privileges') - Set Privileges as the index column.
  • apply(lambda col: - Apply to each column:
    • col.str.split(', ?', expand=True) - Break this column into separate columns (a DataFrame, with column names as consecutive integers).
    • stack() - Convert the above DataFrame into a Series with column names as the second MultiIndex level.
  • droplevel(1) - Drop the unncessary index level.
  • reset_index() - Change Privileges (the index column) to a regular column.
  • fillna('') - Change each NaN into an empty string.

The result is:

   Privileges Super_Admin Admin
0       Type1          A1    A1
1       Type1          A2    A2
2       Type1          A3      
3       Type1          A4      
4       Type2          A1    A1
5       Type2          B1    B1
6       Type2          B2    B2
7       Type2          B3      
8       Type2          A4      
9       Type3          C1    B2
10      Type3          B2    C1
11      Type3          C2    C2
12      Type3          C3      
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41