Python Pandas: Split slash separated strings in two or more columns into multiple rows

Question

I have a pandas dataframe that looks like this:

SUBJECT                 STUDENT         CITY        STATE

Math/Chemistry/Biology  Sam/Peter/Mary  Los Angeles CA
Geology/Physics         John            Boston      MA

This is how it should look like:

SUBJECT      STUDENT    CITY           STATE

Math         Sam        Los Angeles    CA
Chemistry    Peter      Los Angeles    CA
Biology      Mary       Los Angeles    CA
Geology      John       Boston         MA
Physics      John       Boston         MA

Before asking this question, I referred to the solutions mentioned in this page: pandas: How do I split text in a column into multiple rows?

Since there are slash separated strings in two columns, I am not able to use the solutions in the above link.

score 3 · Accepted Answer · answered Aug 02 '16 at 06:36

Another solution with concat and join:

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0  0         Math
   1    Chemistry
   2      Biology
1  0      Geology
   1      Physics

print (s2)
0  0      Sam
   1    Peter
   2     Mary
1  0     John
dtype: object

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
        .ffill()
        .reset_index(level=1, drop=True)
print (df1)
     SUBJECT STUDENT
0       Math     Sam
0  Chemistry   Peter
0    Biology    Mary
1    Geology    John
1    Physics    John

df = df.drop(['SUBJECT','STUDENT'], axis=1)
       .join(df1)
       .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
     SUBJECT STUDENT         CITY STATE
0       Math     Sam  Los Angeles    CA
1  Chemistry   Peter  Los Angeles    CA
2    Biology    Mary  Los Angeles    CA
3    Geology    John       Boston    MA
4    Physics    John       Boston    MA

piRSquared · Answer 2 · 2016-08-02T06:47:38.087

First thing, split fields by '/'

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

Then I use a function to explode rows. However, I had to segregate those rows that only had one student or subject.

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)


cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

Finally

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

Timing

piRSquared

%%timeit

df = df_.copy()

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

100 loops, best of 3: 7.76 ms per loop

jezrael

%%timeit

df = df_.copy()

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
        .ffill() \
        .reset_index(level=1, drop=True)

df.drop(['SUBJECT','STUDENT'], axis=1) \
  .join(df1) \
  .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]

100 loops, best of 3: 5.13 ms per loop

I see some numpy function - now it is faster as stack-join method? — jezrael, Aug 02 '16 at 06:10
@jezrael Yours is better. I'm using a general purpose function to explode stuff. I think I'm going to tweak it. I really like how you did it. You gain some efficiency by `expand=True` at the split step. I'll end up moving the split into my function and doing that too. — piRSquared, Aug 02 '16 at 06:50
Hmmm, I think if columns have same items in eaxh row, your solution can be faster. Thank you for timings. — jezrael, Aug 02 '16 at 06:51
@jezrael Yes, very likely. I had to force the use of my function a bit. — piRSquared, Aug 02 '16 at 06:54
I think if use numpy stuff it can be doog approach for improvements. — jezrael, Aug 02 '16 at 06:56

Merlin · Answer 3 · 2016-08-02T02:14:10.537

Try this: Can be modified where SUBJECT equals 1 and zip is then used.

df3.SUBJECT = df3.SUBJECT.str.split('/')
df3.STUDENT = df3.STUDENT.str.split('/')

def splitter(gb):
    ll = []
    subs, stus = gb.SUBJECT.values[0], gb.STUDENT.values[0]

    if   len(stus) == len(subs): ll = zip(subs,stus)
    elif len(stus) == 1:         ll = zip(subs,stus*len(subs))
    return pd.DataFrame(ll, columns= (["SUBJECT","STUDENT"])) 

df = df3.groupby(['CITY','STATE'])['SUBJECT','STUDENT'].apply(splitter).reset_index().drop('level_2', axis =1)
print df[[ 'SUBJECT', 'STUDENT', 'CITY','STATE' ]]


   SUBJECT STUDENT        CITY STATE
0    Geology    John      Boston    MA
1    Physics    John      Boston    MA
2       Math     Sam  LosAngeles    CA
3  Chemistry   Peter  LosAngeles    CA
4    Biology    Mary  LosAngeles    CA

Python Pandas: Split slash separated strings in two or more columns into multiple rows

3 Answers3

Timing