Let's say we have a list of employees and some other data:
Employee Location Title
0 1 Location1 Title1
1 2 Location2 Title1
2 3 Location3 Title2
3 4 Location1 Title3
4 5 Location1 Title2
I am transposing it to features and labels with (1,0) values, and it works but it takes ages on databased of 6k records. Logic: take value from Location, make it a column, if employees location matches column put 1, else put 0.
My question: is it possible to optimize performance somehow? My lacking of terminology makes it difficult to find the better solution, but I believe that something should be there.
The final output looks like this:
Employee Location1 Location2 Location3 Title1 Title2 Title3
0 1 1 0 0 1 0 0
1 2 0 1 0 1 0 0
2 3 0 0 1 0 1 0
3 4 1 0 0 0 0 1
4 5 1 0 0 0 1 0
The working code that takes ages to complete:
import pandas as pd
df = pd.DataFrame.from_dict({'Employee': ['1','2','3','4','5'],
'Location': ['Location1', 'Location2','Location3','Location1','Location1'],
'Title': ['Title1','Title1','Title2','Title3','Title2']
})
df_tr = df['Employee'] #temporary employee ids
# transposing the data, which takes ages:
df_newcols = {}
for column in list(df)[1:]:
newcols = df[column].unique()
for key in newcols:
temp_ar = []
for value in df[column]:
if key == value:
temp_ar.append(1)
else:
temp_ar.append(0)
df_newcols[key] = temp_ar
print (df_newcols)
# adding transposed to the temp df
df_temp = pd.DataFrame.from_dict(df_newcols)
# merging with df with employee ids
new_df = pd.concat([df_tr,df_temp],axis=1)