2

I have data something like below:

CANDIDATE_ID Job1_Skill1
12 conflict management
13 asset management

I want to add one hot encoded columns for each skill in table python and pandas based on the reference skill set(list). for example if reference skill set given is [conflict management, asset management, .net] then my output should be something like below:

CANDIDATE_ID Job1_Skill1 FP_conflict management FP_ asset management FP_.net
12 conflict management 1 0 0
13 asset management 0 1 0

I could do it comparing row by row but it does not seem to be an efficient approach. Can anyone suggest efficient way to do this using python?

get_dummies method gives output based on values in same column but I need to compare values for a specific reference list to encode i.e. get_dummies can give encoding only for FP_Conflict_management and FP_asset_management and not for FP_.net and also get_dummies will be dynamic for each dataframe. I need to encode based on specific list of skills for every dataframe but I need to compare the values with different column for encoding hence it cannot be used.

1 Answers1

0

Here is simple workaround by adding reference list to the source data dataframe.

# set up source data
df_data = pd.DataFrame([[12,'conflict'],[13,'asset']],columns=['CANDIDATE_ID','Job1_Skill1'])

# define reference list with some unique ids
skills = [[999, 'conflict'],[999, 'asset'],[999, '.net']]
df_skills = pd.DataFrame(skills,columns=['CANDIDATE_ID','Job1_Skill1'])

# add reference data to main df
df_data_with_skills = df_data.append(df_skills, ignore_index=True) 

# encode with pd.get_dummies
skills_dummies = pd.get_dummies(df_data_with_skills.Job1_Skill1)
result = pd.concat([df_data_with_skills, skills_dummies], axis=1)

# remove reference rows
result.drop(result[result['CANDIDATE_ID'] == 999].index, inplace = True) 
print(result)

enter image description here

Ilya Berdichevsky
  • 1,249
  • 10
  • 24
  • what if there are multiple columns. I still want single one hot encoded column per skill. like df_data = pd.DataFrame([[12,'conflict','.net'],[13,'asset','python']],columns=['CANDIDATE_ID','Job1_Skill1','job1_skill2']) I still want the same structure as your output. – Vishal Bapat Jan 28 '21 at 12:07
  • just .net for candidate_i=12 should be 1 and not zero. – Vishal Bapat Jan 28 '21 at 15:27