Python: One hot encoding using reference list

Question

I have data something like below:

CANDIDATE_ID	Job1_Skill1
12	conflict management
13	asset management

I want to add one hot encoded columns for each skill in table python and pandas based on the reference skill set(list). for example if reference skill set given is [conflict management, asset management, .net] then my output should be something like below:

CANDIDATE_ID	Job1_Skill1	FP_conflict management	FP_ asset management	FP_.net
12	conflict management	1	0	0
13	asset management	0	1	0

I could do it comparing row by row but it does not seem to be an efficient approach. Can anyone suggest efficient way to do this using python?

get_dummies method gives output based on values in same column but I need to compare values for a specific reference list to encode i.e. get_dummies can give encoding only for FP_Conflict_management and FP_asset_management and not for FP_.net and also get_dummies will be dynamic for each dataframe. I need to encode based on specific list of skills for every dataframe but I need to compare the values with different column for encoding hence it cannot be used.

Does this answer your question? [How can I one hot encode in Python?](https://stackoverflow.com/questions/37292872/how-can-i-one-hot-encode-in-python) — Tomerikoo, Jan 13 '21 at 12:44
If your question is resolved, please accept the answer so it gets marked as answered. — Ilya Berdichevsky, Mar 04 '21 at 18:59

score 0 · Answer 1 · answered Jan 13 '21 at 16:37

Here is simple workaround by adding reference list to the source data dataframe.

# set up source data
df_data = pd.DataFrame([[12,'conflict'],[13,'asset']],columns=['CANDIDATE_ID','Job1_Skill1'])

# define reference list with some unique ids
skills = [[999, 'conflict'],[999, 'asset'],[999, '.net']]
df_skills = pd.DataFrame(skills,columns=['CANDIDATE_ID','Job1_Skill1'])

# add reference data to main df
df_data_with_skills = df_data.append(df_skills, ignore_index=True) 

# encode with pd.get_dummies
skills_dummies = pd.get_dummies(df_data_with_skills.Job1_Skill1)
result = pd.concat([df_data_with_skills, skills_dummies], axis=1)

# remove reference rows
result.drop(result[result['CANDIDATE_ID'] == 999].index, inplace = True) 
print(result)

what if there are multiple columns. I still want single one hot encoded column per skill. like df_data = pd.DataFrame([[12,'conflict','.net'],[13,'asset','python']],columns=['CANDIDATE_ID','Job1_Skill1','job1_skill2']) I still want the same structure as your output. — Vishal Bapat, Jan 28 '21 at 12:07

Python: One hot encoding using reference list

1 Answers1