I am currently trying to convert multiple categorical variables into a binary coding using the tidyr package in R. I have data in the following format:
mydata <- data.frame(type = c("tcp", "udp", "tcp", "tcp", "tcp"),
service = c("ftp", "other", "private", "http", "http"),
flag = c("SF", "SF", "S0", "SF", "SF"))
And I want to have a binary coding for every type, service and flag.
My first attempt was the following: (based on Stackoverflow post)
mydata %>%
select(type, service, flag) %>%
mutate(ID = 1:nrow(.)) %>%
gather(type, version, c(type, service, flag)) %>%
mutate(present = 1) %>%
select(-type) %>%
spread(version, present, fill = 0)
It seems that the result is correct but the following error message is thrown:
"attributes are not identical across measure variables; they will be dropped "
In a second attempt I did it in a very very poor coding style but it works properly:
mydata %>%
select(type, service, flag) %>%
mutate(type = 1, ID = 1:nrow(.))%>%
distinct(ID, .keep_all = TRUE) %>%
spread(type, type, fill = 0) %>%
mutate(type = 1) %>%
distinct(ID, .keep_all = TRUE) %>%
spread(service, type, fill = 0) %>%
mutate(type = 1) %>%
distinct(ID, .keep_all = TRUE) %>%
spread(flag, type, fill = 0) %>%
arrange(ID)
I would really prefer the first solution but I am not sure what happened internally. And as I have a really huge data set I can not have a look at each entry if everything went correct. So my question is: Does anybody know why the error message is thrown and how to solve the issue? I would prefer a solution using the tidyr package but also other proposals are very welcome!