Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)

Question

I have a number of CSV files with columns such as gender, age, diagnosis, etc.

Currently, they are coded as such:

ID, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd

The goal is to transform this data into this target format:

Sidenote: if possible, it would be great to also prepend the original column names to the new column names, e.g. 'age_42' or 'gender_female.'

ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd
1,  1,    0,      1,  0,  0,  0,  1,      1,       0,        0
2,  1,    0,      0,  1,  0,  0,  1,      0,       0,        0
3,  0,    1,      0,  0,  1,  0,  0,      0,       1,        0
4,  0,    1,      0,  0,  0,  1,  0,      0,       1,        1

I've attempted using reshape2's dcast() function but am getting combinations resulting in extremely sparse matrices. Here's a simplified example with just age and gender:

data.train  <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length)

ID, male19, male23, male42, male61, female19, female23, female42, female61
1,  0,      0,      1,      0,      0,        0,        0,        0
2,  1,      0,      0,      0,      0,        0,        0,        0
3,  0,      0,      0,      0,      0,        1,        0,        0
4,  0,      0,      0,      0,      0,        0,        0,        1

Seeing as this is a fairly common task in machine learning data preparation, I imagine there may be other libraries (that I'm unaware of) that are able to perform this transformation.

David Arenburg · Answer 1 · 2015-05-16T22:16:12.857

You need a melt/dcast combination here (which called recast) in order to convert all columns into one column and avoid combinations

library(reshape2)
recast(df, ID ~ value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID 19 23 42 61 anxiety asthma copd diabetes female male
# 1  1  0  0  1  0       1      1    0        0      0    1
# 2  2  1  0  0  0       0      1    0        0      0    1
# 3  3  0  1  0  0       0      0    0        1      1    0
# 4  4  0  0  0  1       0      0    1        1      1    0

As per your Sidenote, you can add variable here in order to get the names added too

recast(df, ID ~ variable + value, id.var = 1, fun.aggregate = function(x) (length(x) > 0) + 0L)
#   ID gender_female gender_male age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma diagnosis_copd
# 1  1             0           1      0      0      1      0                 1                1              0
# 2  2             0           1      1      0      0      0                 0                1              0
# 3  3             1           0      0      1      0      0                 0                0              0
# 4  4             1           0      0      0      0      1                 0                0              1
#   diagnosis_diabetes
# 1                  0
# 2                  0
# 3                  1
# 4                  1

score 5 · Answer 2 · answered May 16 '15 at 23:55

5

There is a function in the caret package to "dummify" data.

library(caret)
library(dplyr)
predict(dummyVars(~ ., data = mutate_each(df, funs(as.factor))), newdata = df)

answered May 16 '15 at 23:55

Steven Beaupré

21,343
7
57
77

akrun · Accepted Answer · 2015-05-17T08:11:54.573

A base R option would be

 (!!table(cbind(df1[1],stack(df1[-1])[-2])))*1L
 #     values
 #ID  19 23 42 61 anxiety asthma copd diabetes female male
 # 1  0  0  1  0       1      1    0        0      0    1
 # 2  1  0  0  0       0      1    0        0      0    1
 # 3  0  1  0  0       0      0    0        1      1    0
 # 4  0  0  0  1       0      0    1        1      1    0

If you need the original name as well

 (!!table(cbind(df1[1],Val=do.call(paste, c(stack(df1[-1])[2:1], sep="_")))))*1L
 #   Val
 #ID  age_19 age_23 age_42 age_61 diagnosis_anxiety diagnosis_asthma
 #1      0      0      1      0                 1                1
 #2      1      0      0      0                 0                1
 #3      0      1      0      0                 0                0
 #4      0      0      0      1                 0                0
 #  Val
 #ID  diagnosis_copd diagnosis_diabetes gender_female gender_male
 #1              0                  0             0           1
 #2              0                  0             0           1
 #3              0                  1             1           0
 #4              1                  1             1           0

data

df1 <- structure(list(ID = c(1L, 1L, 2L, 3L, 4L, 4L), gender = c("male", 
"male", "male", "female", "female", "female"), age = c(42L, 42L, 
19L, 23L, 61L, 61L), diagnosis = c("asthma", "anxiety", "asthma", 
"diabetes", "diabetes", "copd")), .Names = c("ID", "gender", 
"age", "diagnosis"), row.names = c(NA, -6L), class = "data.frame")

+1 for a good vanilla implementation; I'm going to give this one a run too. Format isn't quite there yet for my data, but that's a data cleaning, rather than structure, issue. — Greenstick, May 17 '15 at 22:53
Update: Accepting this as the answer; in testing it, it provides the most consistent implementation and output. — Greenstick, May 17 '15 at 23:17

Jota · Answer 4 · 2015-05-17T03:41:40.640

Using reshape from base R:

d <- reshape(df, idvar="ID", timevar="diagnosis", direction="wide", v.names="diagnosis", sep="_")
a <- reshape(df, idvar="ID", timevar="age", direction="wide", v.names="age", sep="_")
g <- reshape(df, idvar="ID", timevar="gender", direction="wide", v.names="gender", sep="_")


new.dat <- cbind(ID=d["ID"],
    g[,grepl("_", names(g))],
    a[,grepl("_", names(a))],
    d[,grepl("_", names(d))])

# convert factors columns to character (if necessary)
# taken from @Marek's answer here: http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters/2853231#2853231
new.dat[sapply(new.dat, is.factor)] <- lapply(new.dat[sapply(new.dat, is.factor)], as.character)

new.dat[which(is.na(new.dat), arr.ind=TRUE)] <- 0
new.dat[-1][which(new.dat[-1] != 0, arr.ind=TRUE)] <- 1

#  ID gender_male gender_female age_42 age_19 age_23 age_61 diagnosis_asthma
#1  1           1             0      1      0      0      0                1
#3  2           1             0      0      1      0      0                1
#4  3           0             1      0      0      1      0                0
#5  4           0             1      0      0      0      1                0
#  diagnosis_anxiety diagnosis_diabetes diagnosis_copd
#1                 1                  0              0
#3                 0                  0              0
#4                 0                  1              0
#5                 0                  1              1

score 1 · Answer 5 · edited May 23 '17 at 12:15

Below is a slightly longer way with dcast() and merge(). As gender and age are not unique by ID, a function is created to turn its length into a dummy variable (dum()). On the other hand, diagnosis is set to be counted uniquely by adjusting the formula.

library(reshape2)
data.raw <- read.table(header = T, sep = ",", text = "
id, gender, age, diagnosis
1,  male,   42,  asthma
1,  male,   42,  anxiety
2,  male,   19,  asthma
3,  female, 23,  diabetes
4,  female, 61,  diabetes
4,  female, 61,  copd")

# function to create a dummy variable
dum <- function(x) { if(length(x) > 0) 1 else 0 }

# length of dignosis by id, gender and age
diag <- dcast(data.raw, formula = id + gender + age ~ diagnosis, fun.aggregate = length)[,-c(2,3)]

# length of gender by id
gen <- dcast(data.raw, formula = id ~ gender, fun.aggregate = dum)

# length of age by id
age <- dcast(data.raw, formula = id ~ age, fun.aggregate = dum)

merge(merge(gen, age, by = "id"), diag, by = "id")
#  id   female   male 19 23 42 61   anxiety   asthma   copd   diabetes
#1  1        0      1  0  0  1  0         1        1      0          0
#2  2        0      1  1  0  0  0         0        1      0          0
#3  3        1      0  0  1  0  0         0        0      0          1
#4  4        1      0  0  0  0  1         0        0      1          1

Actually I'm not well aware of your model but your setting may be too much as R handles factors by the formula object. For example, if gender is the response, the following matrix will be generated within R. Therefore, as long as you are not going to fit on your own, it'd suffice to set data types and formula appropriately.

data.raw$age <- as.factor(data.raw$age)
model.matrix(gender ~ ., data = data.raw[,-1])
#(Intercept) age23 age42 age61 diagnosis  asthma diagnosis  copd diagnosis  diabetes
#1           1     0     1     0                 1               0                   0
#2           1     0     1     0                 0               0                   0
#3           1     0     0     0                 1               0                   0
#4           1     1     0     0                 0               0                   1
#5           1     0     0     1                 0               0                   1
#6           1     0     0     1                 0               1                   0

If you need all the levels of each variable you can do this by suppressing the intercept in model.matrix and using a wee trick from all-levels-of-a-factor-in-a-model-matrix-in-r

#  Using Akrun's df1, first change all variables, except ID, to factor
df1[-1] <- lapply(df1[-1], factor)

# Use model.matrix to derive dummy coding
m <- data.frame(model.matrix( ~ 0 + . , data=df1, 
             contrasts.arg = lapply(df1[-1], contrasts, contrasts=FALSE)))

# Collapse to give final solution
aggregate(. ~ ID, data=m, max)

Hi Jaehyeon ... I didn't think it was worth adding this as a separate answer and seems to fit here. Please roll back the edit if you dont want it. — user20650, May 17 '15 at 14:14

Converting Column Values into Their Own Binary Encoded Columns (Dummy Variables)

5 Answers5

data