Faster Alternatives for for-loop

Question

I have the following problem:

My Data Frame looks like the following, even though its a lot bigger (20GB):

Letters <- c("A","B","C")
Numbers <- c(1,0,1)
Numbers <- as.integer(Numbers)

Data.Frame <- data.frame(Letters,Numbers)

Now I want to create a Dummy Variable for the Letters and wrote the following for-loop:

for(level in unique(Data.Frame$Letters)){Data.Frame[paste("", level, sep = "")]
<- ifelse(Data.Frame$Letters == level, 1, 0)}

Because my Data-frame is so large though it takes a very long time to execute. Another possible solution I tried was:

factors <- model.matrix(~Letters-1, data=Data.Frame)
cbind(Data.Frame, factors)

The result is the same, but when I use this on a larger Data-frame it takes even longer.

Are there any possible alternatives, which would result in the same solution and are computationally faster?

Thank you very much in advance!

what are dimensions of your data? how many unique values are in Letters? — minem, Apr 23 '18 at 11:15
as your data is large , maybe a sparse solution might work. flodel gives a fast way to generate [here](https://stackoverflow.com/questions/23035982/directly-creating-dummy-variable-set-in-a-sparse-matrix-in-r?answertab=votes#tab-top) — user20650, Apr 23 '18 at 11:16
Regarding the dimensions: I have about 18 Million rows and about 13.000 unique values in Letters. — Mucteam, Apr 23 '18 at 11:18
so creating dummy variables will create a 18M x 13000 structure - I'd think you must use a sparse matrix. — user20650, Apr 23 '18 at 11:19
@user20650. Thanks for the suggestions! I will check it out and see if they work for my problem. — Mucteam, Apr 23 '18 at 11:21
1) `paste("", level, sep = "")` does nothing and takes time to do it. 2) `Data.Frame[level] <- (Data.Frame$Letters == level) + 0L` is faster than `ifelse`. — Rui Barradas, Apr 23 '18 at 11:24

score 1 · Answer 1 · answered Apr 23 '18 at 12:10

If you have enough RAM you could try this:

n <- 18e6
set.seed(31)
d <- data.frame(Letters = as.factor(sample.int(1.3e4, n, replace = T)),
                Numbers = sample.int(30, n, replace = T))
require(data.table)
dt <- as.data.table(d)
x2 <- as.integer(dt$Letters)
ilist <- unique(x2)[1:20] # for test 20 cols
for (i in ilist) {
  set(dt, j = as.character(i), value = (x2 == i) + 0L)
}

Otherwise you should use sparse matrix as suggested by other users:

require(Matrix)
dd <- sparse.model.matrix(~ Letters - 1, data = d)
dd[1:5, 1:5]
# 5 x 5 sparse Matrix of class "dgCMatrix"
#   Letters1 Letters2 Letters3 Letters4 Letters5
# 1        .        .        .        .        .
# 2        .        .        .        .        .
# 3        .        .        .        .        .
# 4        .        .        .        .        .
# 5        .        .        .        .        .

score 1 · Accepted Answer · answered Apr 23 '18 at 12:21

1

You could use dcast.data.table from package data.table like this

dt <- data.table(Letters,Numbers)
dcast.data.table(dt, Letters+Numbers~Letters,fun.aggregate=length)

   Letters Numbers A B C
1:       A       1 1 0 0
2:       B       0 0 1 0
3:       C       1 0 0 1

answered Apr 23 '18 at 12:21

Pierre Lapointe

16,017
2
43
56

Why the downvote? This gives the result OP wants and is very fast on large data sets. – Pierre Lapointe Apr 23 '18 at 12:26
Someone downvotes every answer in this thread, without leaving any explanation. – Stéphane Laurent Apr 23 '18 at 12:32
@PierreLapointe works very well! Thank you! – Mucteam Apr 23 '18 at 13:43
@Mucteam I have timed the several answers and this one is the fastest, followed by the OP's with my suggestion, which I found surprising. (4 times faster.) – Rui Barradas Apr 23 '18 at 16:45

Stéphane Laurent · Answer 3 · 2018-04-23T11:41:23.647

-1

Maybe faster with data.table ?

What about

Letters <- c("A","B","C","C")
Numbers <- c(1,0,1,2)
Numbers <- as.integer(Numbers)
Data.Frame <- data.frame(Letters,Numbers)
library(data.table)
DT <- as.data.table(Data.Frame)
Letters <- unique(DT$Letters)
for(l in Letters){
  DT[, (l):=as.integer(Letters==l)]
}


> DT
   Letters Numbers A B C
1:       A       1 1 0 0
2:       B       0 0 1 0
3:       C       1 0 0 1
4:       C       2 0 0 1

edited Apr 23 '18 at 11:41

answered Apr 23 '18 at 11:26

Stéphane Laurent

75,186
15
119
225

Can you see my comment to the question? `==` returns `FALSE/TRUE` plus `0L` gives what the OP want and is always faster. – Rui Barradas Apr 23 '18 at 11:32
if there are 18M rows and 13k unique values , won't this require 18e6 * 13000 * 8 / 2^30 GB storage? – user20650 Apr 23 '18 at 11:32
@RuiBarradas Please see my edit. I've replaced with `as.integer`. Better? – Stéphane Laurent Apr 23 '18 at 11:41
Why do I get some downvotes? Could you please leave an explanation? If I misunderstood something, I could delete my answer. – Stéphane Laurent Apr 23 '18 at 12:31
Apparently someone is downvoting all answers. – Rui Barradas Apr 23 '18 at 13:29
1

@RuiBarradas Yes, that's what I've observed too... I will upvote all answers to counterbalance ;-) – Stéphane Laurent Apr 23 '18 at 13:35

Faster Alternatives for for-loop

3 Answers3