0

I have a dataset where all my data is categorical and I would like to use one hot encoding for further analysis.

Main issues I would like to resolve:

  • Some cells contain many text in one cell (an example will follow).
  • Some numerical values need to be changed to factor for further process.

Data with 3 headings Age, info & Target

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info =         c("c(\"good\", \"bad\", \"sad\"", 
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age", 
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")

I want to create one hot encoding of all these variables shown above so it will look like the following:

       Age_99 Age_10 Age_40 Age_15 good bad sad nice happy joy null okay nice fun wild go Boy Girl 
         1      0       0     0      1   1    1   0     0    0   0   0    0   0   0    0   0   0
         0      1       0     0      0   0    0   1     1    1   0   0    0   0   0    0   0   1

Some of the questions on SO I have checked are this and this.

Boro Dega
  • 393
  • 1
  • 3
  • 13
  • How did you get this data in this form to begin with? Can you `dput` these few lines for us? – A5C1D2H2I1M1N2O1R2T1 Mar 18 '16 at 15:26
  • Possible duplicate of [R DataFrame - One Hot Encoding of column containing multiple terms](https://stackoverflow.com/questions/39778387/r-dataframe-one-hot-encoding-of-column-containing-multiple-terms) – Roman Oct 24 '18 at 12:04

2 Answers2

2

I would suppose that the following should work:

library(splitstackshape)
library(magrittr)

suppressWarnings({                               ## Just to silence melt
  mydf %>%                                       ## The dataset
    as.data.table(keep.rownames = TRUE) %>%      ## Convert to data.table
    .[, Info := gsub("c\\(|\"", "", Info)] %>%   ## Strip out c( and quotes
    cSplit("Info", ",") %>%                      ## Split the "Info" column
    melt(id.vars = "rn") %>%                     ## Melt everyting except rn
    dcast(rn ~ value, fun.aggregate = length)    ## Go wide
})
#    rn 10 15 40 99 Boy Girl NULL bad fun go good happy joy nice okay sad wild NA
# 1:  1  0  0  0  1   1    0    0   1   0  0    1     0   0    0    0   1    0  2
# 2:  2  1  0  0  0   0    1    0   0   0  0    0     1   1    1    0   0    0  2
# 3:  3  0  0  1  0   1    0    1   0   0  0    0     0   0    0    0   0    0  4
# 4:  4  0  1  0  0   1    0    0   0   1  1    0     0   0    1    1   0    1  0

Here's the sample data I used:

mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"", 
    "c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
    ), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age", 
    "Info", "Target"), row.names = c(NA, 4L), class = "data.frame")
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • @A Handcart And Mohair for your answer, but I have a few questions.. where do you get the variable "rn" from. Furthermore, you only split the variable "Info", but "Age" &Target gets also split. is this a norm for cSplit library or can you select which varaibles to split which not to split. Thanks – Boro Dega Apr 01 '16 at 15:47
0

You can use the grepl function to scan each string for whatever you are looking for, and use ifelse to fill the column appropriately. Something like:

 # This will create a new column labeled 'good' with 1 if the string contains and 0 if not 
 data$good =  ifelse(grepl("good",data$info),1, 0)
 # and do this for each variable of interest 

And at the end you can remove the info column if you'd like. This way you don't have to make any new data tables.

 data$info <- NULL

Note that you should change 'data' to whatever the actual name of your data set is. As for the problem with age, no need to change it into factors, just use:

data$age99 = ifelse(data$Age == 99, 1,0) # and so forth for the other ages

cgage1
  • 579
  • 5
  • 15