2

I have the following problem:

My Data Frame looks like the following, even though its a lot bigger (20GB):

Letters <- c("A","B","C")
Numbers <- c(1,0,1)
Numbers <- as.integer(Numbers)

Data.Frame <- data.frame(Letters,Numbers)

Now I want to create a Dummy Variable for the Letters and wrote the following for-loop:

for(level in unique(Data.Frame$Letters)){Data.Frame[paste("", level, sep = "")]
<- ifelse(Data.Frame$Letters == level, 1, 0)}

Because my Data-frame is so large though it takes a very long time to execute. Another possible solution I tried was:

factors <- model.matrix(~Letters-1, data=Data.Frame)
cbind(Data.Frame, factors)

The result is the same, but when I use this on a larger Data-frame it takes even longer.

Are there any possible alternatives, which would result in the same solution and are computationally faster?

Thank you very much in advance!

Mucteam
  • 345
  • 3
  • 12
  • what are dimensions of your data? how many unique values are in Letters? – minem Apr 23 '18 at 11:15
  • as your data is large , maybe a sparse solution might work. flodel gives a fast way to generate [here](https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r?answertab=votes#tab-top) – user20650 Apr 23 '18 at 11:16
  • Regarding the dimensions: I have about 18 Million rows and about 13.000 unique values in Letters. – Mucteam Apr 23 '18 at 11:18
  • so creating dummy variables will create a 18M x 13000 structure - I'd think you must use a sparse matrix. – user20650 Apr 23 '18 at 11:19
  • @user20650. Thanks for the suggestions! I will check it out and see if they work for my problem. – Mucteam Apr 23 '18 at 11:21
  • 1
    1) `paste("", level, sep = "")` does nothing and takes time to do it. 2) `Data.Frame[level] <- (Data.Frame$Letters == level) + 0L` is faster than `ifelse`. – Rui Barradas Apr 23 '18 at 11:24

3 Answers3

1

If you have enough RAM you could try this:

n <- 18e6
set.seed(31)
d <- data.frame(Letters = as.factor(sample.int(1.3e4, n, replace = T)),
                Numbers = sample.int(30, n, replace = T))
require(data.table)
dt <- as.data.table(d)
x2 <- as.integer(dt$Letters)
ilist <- unique(x2)[1:20] # for test 20 cols
for (i in ilist) {
  set(dt, j = as.character(i), value = (x2 == i) + 0L)
}

Otherwise you should use sparse matrix as suggested by other users:

require(Matrix)
dd <- sparse.model.matrix(~ Letters - 1, data = d)
dd[1:5, 1:5]
# 5 x 5 sparse Matrix of class "dgCMatrix"
#   Letters1 Letters2 Letters3 Letters4 Letters5
# 1        .        .        .        .        .
# 2        .        .        .        .        .
# 3        .        .        .        .        .
# 4        .        .        .        .        .
# 5        .        .        .        .        .
minem
  • 3,640
  • 2
  • 15
  • 29
1

You could use dcast.data.table from package data.table like this

dt <- data.table(Letters,Numbers)
dcast.data.table(dt, Letters+Numbers~Letters,fun.aggregate=length)

   Letters Numbers A B C
1:       A       1 1 0 0
2:       B       0 0 1 0
3:       C       1 0 0 1
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
-1

Maybe faster with data.table ?

What about

Letters <- c("A","B","C","C")
Numbers <- c(1,0,1,2)
Numbers <- as.integer(Numbers)
Data.Frame <- data.frame(Letters,Numbers)
library(data.table)
DT <- as.data.table(Data.Frame)
Letters <- unique(DT$Letters)
for(l in Letters){
  DT[, (l):=as.integer(Letters==l)]
}


> DT
   Letters Numbers A B C
1:       A       1 1 0 0
2:       B       0 0 1 0
3:       C       1 0 0 1
4:       C       2 0 0 1
Stéphane Laurent
  • 75,186
  • 15
  • 119
  • 225