3

I have a pandas dataframe that looks like this:

SUBJECT                 STUDENT         CITY        STATE

Math/Chemistry/Biology  Sam/Peter/Mary  Los Angeles CA
Geology/Physics         John            Boston      MA

This is how it should look like:

SUBJECT      STUDENT    CITY           STATE

Math         Sam        Los Angeles    CA
Chemistry    Peter      Los Angeles    CA
Biology      Mary       Los Angeles    CA
Geology      John       Boston         MA
Physics      John       Boston         MA

Before asking this question, I referred to the solutions mentioned in this page: pandas: How do I split text in a column into multiple rows?

Since there are slash separated strings in two columns, I am not able to use the solutions in the above link.

Community
  • 1
  • 1

3 Answers3

3

Another solution with concat and join:

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0  0         Math
   1    Chemistry
   2      Biology
1  0      Geology
   1      Physics

print (s2)
0  0      Sam
   1    Peter
   2     Mary
1  0     John
dtype: object
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
        .ffill()
        .reset_index(level=1, drop=True)
print (df1)
     SUBJECT STUDENT
0       Math     Sam
0  Chemistry   Peter
0    Biology    Mary
1    Geology    John
1    Physics    John

df = df.drop(['SUBJECT','STUDENT'], axis=1)
       .join(df1)
       .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
     SUBJECT STUDENT         CITY STATE
0       Math     Sam  Los Angeles    CA
1  Chemistry   Peter  Los Angeles    CA
2    Biology    Mary  Los Angeles    CA
3    Geology    John       Boston    MA
4    Physics    John       Boston    MA
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

First thing, split fields by '/'

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

Then I use a function to explode rows. However, I had to segregate those rows that only had one student or subject.

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)


cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

Finally

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

enter image description here


Timing

piRSquared

%%timeit

df = df_.copy()

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

100 loops, best of 3: 7.76 ms per loop

jezrael

%%timeit

df = df_.copy()

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
        .ffill() \
        .reset_index(level=1, drop=True)

df.drop(['SUBJECT','STUDENT'], axis=1) \
  .join(df1) \
  .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]

100 loops, best of 3: 5.13 ms per loop
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

Try this: Can be modified where SUBJECT equals 1 and zip is then used.

df3.SUBJECT = df3.SUBJECT.str.split('/')
df3.STUDENT = df3.STUDENT.str.split('/')

def splitter(gb):
    ll = []
    subs, stus = gb.SUBJECT.values[0], gb.STUDENT.values[0]

    if   len(stus) == len(subs): ll = zip(subs,stus)
    elif len(stus) == 1:         ll = zip(subs,stus*len(subs))
    return pd.DataFrame(ll, columns= (["SUBJECT","STUDENT"])) 

df = df3.groupby(['CITY','STATE'])['SUBJECT','STUDENT'].apply(splitter).reset_index().drop('level_2', axis =1)
print df[[ 'SUBJECT', 'STUDENT', 'CITY','STATE' ]]


   SUBJECT STUDENT        CITY STATE
0    Geology    John      Boston    MA
1    Physics    John      Boston    MA
2       Math     Sam  LosAngeles    CA
3  Chemistry   Peter  LosAngeles    CA
4    Biology    Mary  LosAngeles    CA
Merlin
  • 24,552
  • 41
  • 131
  • 206