Simply put, I need to convert the following input dataframe to the output below.
After a few hours struggling to figure out how, by combining multiple previous stackoverflow questions, I could transform dataframe, but it takes so much time for large dataframe to convert since I use pivot and apply method.
import numpy as np
import pandas as pd
df = pd.DataFrame({"id":[1,2,3,4,5],
"day":pd.Timestamp('20190529'),
"subject":"math,english,economics",
"score":pd.Categorical(["68,62,49","58,72,87","28,32,46","48,72,66","46,25,93"]),
"Department":pd.Categorical(["Economics","Computer Science","Sociology","Business","Math"])})
---Input DataFrame---
id day subject score Department
0 1 2019-05-29 math,english,economics 68,62,49 Economics
1 2 2019-05-29 math,economics 58,87 Computer Science
2 3 2019-05-29 philosophy,english,business 28,32,46 Sociology
3 4 2019-05-29 physics,sociology 72,66 Business
4 5 2019-05-29 Math 93 Math
And output is like the one below
---Output DataFrame---
id day Department Math business economics english math philosophy physics sociology
1 2019-05-29 Economics NaN NaN 49 62 68 NaN NaN NaN
2 2019-05-29 Computer Science NaN NaN 87 NaN 58 NaN NaN NaN
3 2019-05-29 Sociology NaN 46 NaN 32 NaN 28 NaN NaN
4 2019-05-29 Business NaN NaN NaN NaN NaN NaN 72 66
5 2019-05-29 Math 93 NaN NaN NaN NaN NaN NaN NaN
And my approach is
- I split subject and score columns by ','.
- Explode the list of elements in subject and score columns to separate row as pandas.Series
- Join each pandas.Series to make new dataframe
- Pivot new dataframe created in step 3
- Drop subject and score column in original dataframe
- Join each dataframe made in step 4 and step 5
And my code is as follows
df["subject"] = df["subject"].str.split(",")
df["score"] = df["score"].str.split(",")
subject = df.apply(lambda x: pd.Series(x['subject']),axis=1).stack().reset_index(level=1, drop=True)
score = df.apply(lambda x: pd.Series(x['score']),axis=1).stack().reset_index(level=1, drop=True)
subject.name = 'subject'
score.name = 'score'
subject_score = pd.concat([subject, score],join='outer', axis=1)
pdf = df.drop('subject', axis=1).drop("score", axis=1).join(subject_score)
pivot = pdf.pivot(columns="subject",values="score")
concate_table = df.drop("subject",axis = 1).drop("score", axis=1)
output = concate_table.join(pivot)
I just started learning pandas recently and I am sure this wouldn't be best way to column transposition.
If you could give me some advice how to optimize this code, I appreciate.
Thank you in advance.