2

Simply put, I need to convert the following input dataframe to the output below.

After a few hours struggling to figure out how, by combining multiple previous stackoverflow questions, I could transform dataframe, but it takes so much time for large dataframe to convert since I use pivot and apply method.

import numpy as np
import pandas as pd
df = pd.DataFrame({"id":[1,2,3,4,5],
                  "day":pd.Timestamp('20190529'),
                  "subject":"math,english,economics",
                  "score":pd.Categorical(["68,62,49","58,72,87","28,32,46","48,72,66","46,25,93"]),
                  "Department":pd.Categorical(["Economics","Computer Science","Sociology","Business","Math"])})


---Input DataFrame---

    id    day           subject                score       Department
0   1   2019-05-29  math,english,economics    68,62,49    Economics
1   2   2019-05-29  math,economics             58,87       Computer Science
2   3   2019-05-29  philosophy,english,business 28,32,46    Sociology
3   4   2019-05-29  physics,sociology           72,66      Business
4   5   2019-05-29  Math                          93          Math

And output is like the one below

---Output DataFrame---

id  day         Department          Math business economics english math philosophy physics sociology
1   2019-05-29  Economics           NaN NaN 49  62  68  NaN NaN NaN
2   2019-05-29  Computer Science    NaN NaN 87  NaN 58  NaN NaN NaN
3   2019-05-29  Sociology           NaN 46  NaN 32  NaN 28  NaN NaN
4   2019-05-29  Business            NaN NaN NaN NaN NaN NaN 72  66
5   2019-05-29  Math                93  NaN NaN NaN NaN NaN NaN NaN

And my approach is

  1. I split subject and score columns by ','.
  2. Explode the list of elements in subject and score columns to separate row as pandas.Series
  3. Join each pandas.Series to make new dataframe
  4. Pivot new dataframe created in step 3
  5. Drop subject and score column in original dataframe
  6. Join each dataframe made in step 4 and step 5

And my code is as follows

df["subject"] = df["subject"].str.split(",")
df["score"] = df["score"].str.split(",")

subject = df.apply(lambda x: pd.Series(x['subject']),axis=1).stack().reset_index(level=1, drop=True)
score = df.apply(lambda x: pd.Series(x['score']),axis=1).stack().reset_index(level=1, drop=True)
subject.name = 'subject'
score.name = 'score'

subject_score = pd.concat([subject, score],join='outer', axis=1)
pdf = df.drop('subject', axis=1).drop("score", axis=1).join(subject_score)

pivot = pdf.pivot(columns="subject",values="score")
concate_table = df.drop("subject",axis = 1).drop("score", axis=1)
output = concate_table.join(pivot)

I just started learning pandas recently and I am sure this wouldn't be best way to column transposition.

If you could give me some advice how to optimize this code, I appreciate.

Thank you in advance.

Micro_Andy
  • 93
  • 1
  • 9

2 Answers2

1

I would define a custom function stack_str to unpack string column to dataframe using expand=True and stack and reset_index to a series.

Apply stack_str to 2 columns of strings to make df1 of 2 columns.

Next, do pivot on df1 to make subject values as columns and scores as values. Finally, join back to df already dropped 2 columns having strings.

def stack_str(x):
        s = x.str.split(',', expand=True).stack().reset_index(level=-1, drop=True)
        return s

df1 = df[['subject', 'score']].apply(stack_list)

Out[984]:
      subject score
0        math    68
0     english    62
0   economics    49
1        math    58
1   economics    87
2  philosophy    28
2     english    32
2    business    46
3     physics    72
3   sociology    66
4        Math    93

df2 = df.drop(['subject', 'score'], axis=1).join(df1.pivot(columns='subject', values='score'))

Out[986]:
   id         day        Department Math business economics english math  \
0   1  2019-05-29         Economics  NaN      NaN        49      62   68
1   2  2019-05-29  Computer_Science  NaN      NaN        87     NaN   58
2   3  2019-05-29         Sociology  NaN       46       NaN      32  NaN
3   4  2019-05-29          Business  NaN      NaN       NaN     NaN  NaN
4   5  2019-05-29              Math   93      NaN       NaN     NaN  NaN

  philosophy physics sociology
0        NaN     NaN       NaN
1        NaN     NaN       NaN
2         28     NaN       NaN
3        NaN      72        66
4        NaN     NaN       NaN
Andy L.
  • 24,909
  • 4
  • 17
  • 29
0

I am using unnesting then with pivot and concat back

s=df[['subject','score']]
s=unnesting(s.apply(lambda x :x.str.split(','),1),['subject','score']).pivot(columns='subject',values='score')
df=pd.concat([df,s],axis=1)

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234