Pandas split corresponding rows based on separator in two columns duplicating everything else

Question

I have an excel sheet

Col1    Col2                          Col3            Col4
John    English\nMaths                34\n33          Pass
Sam     Science                       40              Pass
Jack    English\nHistory\nGeography   89\n07\n98      Pass

Need to convert it to

Col1    Col2      Col3    Col4
John    English   34      Pass
John    Maths     33      Pass
Sam     Science   40      Pass
Jack    English   89      Pass
Jack    History   07      Pass     
Jack    Geography 98      Pass

The excel sheet has \n as separator for corresponding Col2 and col3 column. Just need to pull each subject in a new row with its corresponding marks and copy all the other column contents as it is.

Tried

split_cols = ['Col2', 'Col3']

# loop over the columns and split them
separator = '\n'
for col in split_cols:
    df[[f'{col}_Split1', f'{col}_Split2']] = df[col].str.split(separator, n=1, expand=True).fillna('')

# create two new dataframes with the desired columns
df1 = df[['Col1', 'Col2_Split1', 'Col3_Split1', 'Col4']].rename(columns={'Col2_Split1': 'D', 'Col3_Split1': 'C'})
df2 = df[['Col1', 'Col2_Split2', 'Col3_Split2', 'Col4']].rename(columns={'Col2_Split2': 'D', 'Col3_Split2': 'C'})

# concatenate the two dataframes
final_df = pd.concat([df1, df2], ignore_index=True)

# print the final dataframe
print(final_df)

Corralien · Accepted Answer · 2023-03-01T06:46:44.297

3

You can explode on multiple columns (with a recent version of Pandas >= 1.3) after exploding each string into list:

# First pass
out = (df.assign(Col2=df['Col2'].str.split('\n'), 
                 Col3=df['Col3'].str.split('\n')))

# Fix unbalanced lists
def pad(sr):
    n = max(sr.str.len())
    sr['Col2'] = np.pad(sr['Col2'], (0, n-len(sr['Col2'])))
    sr['Col3'] = np.pad(sr['Col3'], (0, n-len(sr['Col3'])))
    return sr

m = out['Col2'].str.len() != out['Col3'].str.len()
out.loc[m, ['Col2', 'Col3']] = out.loc[m, ['Col2', 'Col3']].apply(pad, axis=1)

# Second pass
out = out.explode(['Col2', 'Col3'], ignore_index=True)
print(out)

# Output
   Col1       Col2 Col3    Col4
0  John    English   34    Pass
1  John      Maths   33    Pass
2   Sam    Science   40    Pass
3  Jack    English   89    Pass
4  Jack    History   07    Pass
5  Jack  Geography   98    Pass
6  Ryan      Maths   12  Failed
7  Ryan    Science   10  Failed
8  Ryan    History    0  Failed

Input dataframe:

import pandas as pd
import numpy as np

data = {'Col1': ['John', 'Sam', 'Jack', 'Ryan'],
        'Col2': ['English\nMaths', 'Science', 'English\nHistory\nGeography', 'Maths\nScience\nHistory'],
        'Col3': ['34\n33', '40', '89\n07\n98', '12\n10'],
        'Col4': ['Pass', 'Pass', 'Pass', 'Failed']}
df = pd.DataFrame(data)
print(df)

# Output
   Col1                         Col2        Col3    Col4
0  John               English\nMaths      34\n33    Pass
1   Sam                      Science          40    Pass
2  Jack  English\nHistory\nGeography  89\n07\n98    Pass
3  Ryan      Maths\nScience\nHistory      12\n10  Failed

edited Mar 01 '23 at 06:46

answered Mar 01 '23 at 06:03

Corralien

109,409
8
28
52

```ValueError: columns must have matching element counts``` – spd Mar 01 '23 at 06:11
2

You have unbalanced lists between Col2 and Col3. You have some rows where Col2 has more or less items of Col3. `A\nB\nC` <-> `1\n2` so C has no value. – Corralien Mar 01 '23 at 06:13
Is there a way to handle such cases – spd Mar 01 '23 at 06:17
Probably a slow way – Corralien Mar 01 '23 at 06:18
Please Suggest, there shouldn't me much cases [expecting 1 or 2] where unbalanced lists would be encountered. – spd Mar 01 '23 at 06:21
With unbalanced lists, you can modify the code in this [answer](https://stackoverflow.com/a/74044665/18470692). For the last line, as suggested there in the comment underneath, you probably want to use: `result = df.drop(columns=columns).join(exploded)` – ouroboros1 Mar 01 '23 at 06:25
I updated my answer to take into account unbalanced lists. Can you check it please? – Corralien Mar 01 '23 at 06:47
Yes the change worked, Thank You Very Much – spd Mar 01 '23 at 07:24

EyuelDK · Answer 2 · 2023-03-01T06:11:08.757

1

EDITED.

You can achieve your goals using .str.split + .explode methods.

import pandas

df = pandas.DataFrame([
  ["John", "English\nMaths", "34\n33", "Pass"],
  ["Sam", "Science", "40", "Pass"],
  ["Jack", "English\nHistory\nGeography", "89\n07\n98", "Pass"],
])

df[1] = df[1].str.split("\n")
df[2] = df[2].str.split("\n")
df = df.explode([1, 2])
print(df)

edited Mar 01 '23 at 06:11

answered Mar 01 '23 at 05:38

EyuelDK

3,029
2
19
27

The issue is not related to NA values. The column values are separated with '\n' , i need to split those. Please check the edited question. – spd Mar 01 '23 at 05:46
@spd Can you give the actual code + data that you used to create the example. – EyuelDK Mar 01 '23 at 05:49

Pandas split corresponding rows based on separator in two columns duplicating everything else

2 Answers2