In a fictional patients dataset one might encounter the following table:
pd.DataFrame({
"Patients": ["Luke", "Nigel", "Sarah"],
"Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})
Which renders the following dataset:
Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a &
) and that there exists a complete list diseases
of the illnesses, I've yet to find a simple solution to applying to these situations pandas.get_dummies
one-hot encoder to obtain a binary vector for each patient.
How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame?
pd.DataFrame({
"Patients": ["Luke", "Nigel", "Sarah"],
"Cooties":[1, 0, 1],
"Dragon Pox":[0, 1, 0],
"Greyscale":[0, 0, 1]
})