Improve run-time of loop

Question

I am trying to increase the computational efficiency of below process. I have created toy example using data for review. The first method runs in half the time of the second method.

How can I improve upon the run-time in the first-method?

library(sqldf)
id = c(1,1,1,1,2,2,2,5,5,5,5,5,5)
qn = c(0,0,1,1,0,1,0,0,0,1,0,1,0)
d = data.frame(cbind(id,qn))
names(d) = c("id", "qn")

un = unique(d$id)
holder = matrix(0,length(un), 1)
counter = 0

x = proc.time()

for (i in un)
{
  z = head(which(d[d$id == i,]$qn==1),1)
  counter = counter + 1
  holder[counter,] = z
}

proc.time() - x
f = sqldf("select id, count(qn) from d group by id", drv = 'SQLite')
f = cbind(f,holder)
#################################
un = unique(d$id)
holder = matrix(0,length(un), 1)
counter = 0

x = proc.time()

for (i in 1:length(un))
{
  y = paste("select * from d where id = ", un[i])
  y = sqldf(y, drv = 'SQLite')
  y = min(which(y$qn==1))
  counter = counter + 1
  holder[counter,] = y
}

proc.time() - x
f = sqldf("select id, count(qn) from d group by id", drv = 'SQLite')
f = cbind(f,holder)

I am trying to calculate for each id the first instance of a 1.

Expected output:

# id first
# 1:  1     3
# 2:  2     2
# 3:  5     3

It's better to try to describe what you are trying to do rather than just show how you've done it. What is the ultimate goal? — MrFlick, Oct 20 '15 at 05:31
Thanks for the quick reply! I try to calculate for each id the first instance of a 1. — quant actuary, Oct 20 '15 at 05:35

score 4 · Answer 1 · answered Oct 20 '15 at 05:46

4

You can do this without sqldf using dplyr

library(dplyr)
d %>% 
    group_by(id) %>% 
    summarize(first=first(which(qn==1)))

answered Oct 20 '15 at 05:46

MrFlick

195,160
17
277
295

Thank you! That works. Speed is now down to 0.03 seconds from 0.05 seconds! What does the %>% mean? – quant actuary Oct 20 '15 at 05:53
The `%>%` represents the 'pipe' operator, and what it does is to pass data onto the next function, see http://stackoverflow.com/questions/24845028/understanding-operator/24845091#24845091 and http://stackoverflow.com/questions/24536154/what-does-mean-in-r – Alex Oct 28 '15 at 01:17

score 4 · Accepted Answer · answered Oct 20 '15 at 05:53

4

We can also use data.table

library(data.table)
setDT(d)[, list(first= which.max(qn)) , id]

answered Oct 20 '15 at 05:53

akrun

874,273
37
540
662

1

Thank you! That works. Speed is now down to 0.01-0.02 seconds! This is the winner! I will update code with all examples. – quant actuary Oct 20 '15 at 06:00
Done! I can't upvote due to repuation being small but will do when able to ! – quant actuary Oct 20 '15 at 06:04

score 3 · Answer 3 · edited Oct 20 '15 at 14:11

1) Using sqldf within lapply:

do.call(rbind,
        lapply(split(d, id), function(i)
          sqldf("SELECT id, min(rowid) AS first
                 FROM (SELECT rowid, *
                       FROM i) AS x
                 WHERE qn = 1"))
        )

##   id first
## 1  1     3
## 2  2     2
## 5  5     3

2) or for a pure SQL solution subtract the rowid of the first row in each group from the first rowid having qn=1 of each group and add 1:

sqldf("select id, min_row1 - min_row + 1 first 
       from (select id, min(rowid) min_row 
             from d 
             group by id)
       join (select id, min(rowid) min_row1 
             from d where qn = 1 
             group by id) using (id)")


##   id first
## 1  1     3
## 2  2     2
## 3  5     3

3) or for an alternative pure SQL solution, create a sequence seq within id in the inner select and then pick out the first one within id group having qn = 1:

sqldf("select id, min(seq) first 
       from (select x.id, x.qn, count() seq 
             from d x 
             join d y on x.rowid >= y.rowid and x.id = y.id 
             group by x.rowid)
       where qn = 1
       group by id")

##   id first
## 1  1     3
## 2  2     2
## 3  5     3

@G.Grothendieck is there a way to get `row_number over(partition by id)` functionality without using `RPostgreSQL`? Then we could drop lapply. — zx8754, Oct 20 '15 at 11:14
Dude that amazing! I didn't even know that it could be done in regular SQL. Thanks! I tested it but it runs slower than my method 1, but faster than my method 2. — quant actuary, Oct 21 '15 at 04:46

Improve run-time of loop

3 Answers3