3

I need to create a function that takes two integer numbers x and N, where N > x and returns a vector of dimension N with all zeros with the exception of component x, in which it has a 1.

I managed to do it in the following way,

Function=function(x,N){
  vec=rep(0,N)
  r=as.integer(x)
  vec[r]=1
  return(vec)
}

but it is incredibly slow when I need to iterate the process and apply it to a large number of realizations. On the other hand, a friend of mine is able to do the same thing with a single function of python (I think "OneHotEncoder") and it's super fast.

I was wondering if there are functions in R that are suited for this purpose.

zx8754
  • 52,746
  • 12
  • 114
  • 209
3sm1r
  • 520
  • 4
  • 19
  • 5
    The problem may lie in the way how you apply this function to larger cases, and less the function itself. – Axeman May 15 '18 at 13:12
  • without defining a custom function, you can do `library(magrittr); integer(N) %>% \`[<-\`(x, 1L)` – IceCreamToucan May 15 '18 at 13:31
  • 3
    Rather than writing your own one-hot encoder, you could use one of the already-available optimized methods, like `model.matrix` (or `Matrix::sparse.model.matrix`, if your data is really large). If you search the R tag for "one hot encoding" or "dummy variables", you will find many examples. – Gregor Thomas May 15 '18 at 13:43
  • Related posts: https://stackoverflow.com/questions/5048638/automatically-expanding-an-r-factor-into-a-collection-of-1-0-indicator-variables and https://stackoverflow.com/questions/11952706/generate-a-dummy-variable – zx8754 May 15 '18 at 13:47

3 Answers3

7

Along the lines of what @Axeman said, you should think of whether you could find the one-hot-encoding in a vectorized way, i.e., something like that

set.seed(1234)
x = sample.int(5, size=10, replace=TRUE)
x
#  [1] 1 4 4 4 5 4 1 2 4 3

nC = max(x) #could be also larger (user-defined)
nR = length(x)
matrix(`[<-`(integer(nR * nC),(seq.int(nR) - 1) * nC + x, 1),
       nR, nC, byrow=TRUE)
#       [,1] [,2] [,3] [,4] [,5]
#  [1,]    1    0    0    0    0
#  [2,]    0    0    0    1    0
#  [3,]    0    0    0    1    0
#  [4,]    0    0    0    1    0
#  [5,]    0    0    0    0    1
#  [6,]    0    0    0    1    0
#  [7,]    1    0    0    0    0
#  [8,]    0    1    0    0    0
#  [9,]    0    0    0    1    0
# [10,]    0    0    1    0    0

Compare model.matrix approach to approach given above:

#longer input vector
x = sample.int(5, size=1e4, replace=TRUE)

oneHotMtx = function(x) {
  nC = max(x) #could be also larger (user-defined)
  nR = length(x)
  matrix(`[<-`(integer(nR * nC),(seq.int(nR) - 1) * nC + x, 1),
         nR, nC, byrow=TRUE)
}

oneHotMdl = function(x) {
  xf = factor(x)
  model.matrix(~xf+0)
}

oneHotMdl2=function(x) {
  #version without factor conversion
  model.matrix(~x+0)
}

xf = factor(x)
library(microbenchmark)
microbenchmark(oneHotMtx(x),
               oneHotMdl(x),
               oneHotMdl2(xf), times=1e3)

#Unit: microseconds
#          expr      min       lq      mean    median       uq        max neval cld
#  oneHotMtx(x)  386.621  412.510  678.2977  416.4625  435.382   5394.265  1000 a  
#  oneHotMdl(x) 7363.481 7528.230 8823.8435 7629.8850 7851.019 261808.302  1000   c
#oneHotMdl2(xf) 4253.366 4377.784 5059.0979 4471.5315 4638.637 257106.400  1000  b 
cryo111
  • 4,444
  • 1
  • 15
  • 37
  • But why use this method to one-hot encode one variable at a time instead of using `model.matrix` to do them all at once? No doubt this is an improvement on OP's code, but why stop here? – Gregor Thomas May 15 '18 at 15:11
  • Nice, though I do think it's not really fair to include the factor conversion in the model matrix runtime, I think the more common use case would be starting with factors. – Gregor Thomas May 15 '18 at 15:37
  • 1
    Thank you @Gregor, I think that you got precisely where the problem was. Until now I was using a for loop to apply the function to every components of large vectors. "model.matrix" seems to be what I was looking for. I'll study how it works right now. – 3sm1r May 15 '18 at 15:37
  • 1
    @Gregor Good point - added another version without factor conversion. – cryo111 May 15 '18 at 15:43
  • 1
    @3sm1r just make sure any columns you want to one-hot encode are factors, then `mm = model.matrix(response ~ . + 0, data = your_data_frame)`. – Gregor Thomas May 15 '18 at 15:44
  • @cryo111 I had to think about it, but now I get your point: you are saying that encoding with a vector operation is way faster than solving the problem considering the components one by one. I'll surely give it a try. Thanks – 3sm1r May 15 '18 at 20:15
  • EDIT @cryo111 The speed has increased by orders of magnitude with your method. Thanks again. – 3sm1r May 15 '18 at 21:36
  • @3sm1r Gregor's solution comes with some advantages, such as that it can be easily applied to multiple factors at once (see his comment above). However, `model.matrix` also seems to have some overhead (maybe `NA` handling or similar...) that makes it a bit slower than the "pure" matrix-based approach. What works better really depends on your data... – cryo111 May 16 '18 at 15:50
  • @cryo111 my data are quite simple: I have vectors with 64 components each of which can have values from 0 to 12 (a chessboard). I want to transform these vectors in new vectors with 13*64=832 components with 0 or 1 as possible values each. Both your methods seem to work well for this purpose. – 3sm1r May 17 '18 at 09:02
6

Try

one_hot_encoder <- function(x, N) {
  vec <- integer(N)
  vec[x] <- 1L
  return(vec)
}
symbolrush
  • 7,123
  • 1
  • 39
  • 67
1

A litle bit more detaild variant of the @cryo111 answer:

one_hot_vec <- function(x) {
    nc <- max(x)
    nr <- length(x)
    m <- integer(nr * nc)
    i <- (seq_len(nr) - 1) * nc + x
    m[i] <- 1L
    matrix(m, nrow = nr, ncol = nc, byrow = TRUE)
}
Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57