I have a number of CSV files with columns such as gender, age, diagnosis, etc.
Currently, they are coded as such:
ID, gender, age, diagnosis
1, male, 42, asthma
1, male, 42, anxiety
2, male, 19, asthma
3, female, 23, diabetes
4, female, 61, diabetes
4, female, 61, copd
The goal is to transform this data into this target format:
Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0
2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0
3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0
4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
I've attempted using reshape2's dcast()
function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)
ID, male19, male23, male42, male61, female19, female23, female42, female61
1, 0, 0, 1, 0, 0, 0, 0, 0
2, 1, 0, 0, 0, 0, 0, 0, 0
3, 0, 0, 0, 0, 0, 1, 0, 0
4, 0, 0, 0, 0, 0, 0, 0, 1
Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.