A more elegant way to iterate over two data.tables' rows

Question

I'm starting using R more and more frequently, coming from C/C++. For this reason, I often find myself thinking à la C++ when working with R's data structures.

Here I have two data.tables that I have to iterate through and update the value of column 1 and column 2 in table A with the value of column 2 in table B, according to column 1 table B w.r.t. columns 1 and 2 in table A.

Sorry for this confusing description. I try to make it better

I have two data tables (the number of rows is different because they could actually be different):

TabA

     Col1     Col2
1:   TP53     CD68
2:   TP53    MPDU1
3:   TP53     PHF2
4:   TP53 KIAA0753
5:   CD68    ZBTB4
6:   CD68     CHD3
7:  MPDU1    ZBTB4
8:  MPDU1     CHD3
9:  MPDU1   SLC2A4
10: MPDU1     YBX2
11: MPDU1    AURKB
12: MPDU1 TMEM132B
13:  PHF2 C9orf129
14:  PHF2    CDH23
15:  PHF2   PTPDC1

and TabB:

       Col3 Col4
1:   ADAM32  0
2:   ADARB2  1
3:    AGBL2  2
4:   ALOX12  3
5:  ANKRD46  4
6:    APOL1  5
7:    APOOL  6
8:     ASPA  7
9:      AUH  8
10:   AURKB  9
11:   AUTS2 10
12:    BAAT 11

So basically, I want to compare Col1 and Col2 from TabA with Col3 in TabB: if they are equal substitute the string with the number in Col4 of TabB.

My approach, definitely C-style:

for(i in 1:nrow(TabA)) {
    for(j in 1:nrow(TabB)) {
        if(TabA$Col1[i] == TabB$Col3[j]) { 
            TabA$Col1[i] <- TabB$Col4[j] 
        }
        if(TabA$Col2[i] == TabB$Col3[j]) { 
            TabA$Col2[i] <- TabB$Col4[j] 
        }
    }
}

This works as expected, but I am pretty sure there is a more elegant (and more efficient) way to do that, exploiting data.table's capabilities. Does anybody have a suggestion?

Thanks

Are you sure those are data.tables (which require loading the data.table package)? The tutorials for that package would provide some guidance if that's what you're using: https://github.com/Rdatatable/data.table/wiki/Getting-started — Frank, Jun 30 '15 at 16:13
Don't loop in R unless you absolutely need it. Vectorized everything. Your code can be vectorized. You make no reference to column 4 in your description yet it is present in your code. — Vlo, Jun 30 '15 at 16:36
@Frank Yes, they are data.tables. Thanks for the link, definitely useful — Faabiioo, Jun 30 '15 at 23:02
@Vio I know loops are bad in R, I started feeling it soon after I begun working with R! Unfortunately, in this case I could not find a better solution than this, that's why I thought I needed some help — Faabiioo, Jun 30 '15 at 23:05

score 5 · Accepted Answer · edited Jun 20 '20 at 09:12

If you are using data.table your original problem (prior to the extensive edits which changed the question) could be done like this:

TabA <- data.table(Col1 = sample(LETTERS, 15), Col3 = rnorm(15))
TabB <- data.table(Col2 = sample(LETTERS, 15), Col4 = rnorm(15))
setkey(TabA, Col1)
setkey(TabB, Col2)
TabA
# shows TabA before changing it
TabA[TabB, Col3 := Col4]

I still find data.table's syntax a little odd because it differs from the standard behaviour in R when functions don't modify their arguments. On the other hand, it makes for concise, memory-efficient and fast code.

Solution to rewritten question

Note your sample data is unhelpful - there are no matches of TabA$Col1 in TabB$Col3. Anyway, this could be tackled in a few ways.

Using data.table := operator:

TabA[Col1 %in% TabB$Col3, Col1 := with(TabB, as.character(Col4[na.omit(match(Col1, Col3))]))]
TabA[Col2 %in% TabB$Col3, Col2 := with(TabB, as.character(Col4[na.omit(match(Col2, Col3))]))]

Using base R syntax (would work for a data.frame also):

TabA$Col1[TabA$Col1 %in% TabB$Col3] <- TabB$Col4[match(TabA$Col1[TabA$Col1 %in% TabB$Col3], TabB$Col3)]
TabA$Col2[TabA$Col2 %in% TabB$Col3] <- TabB$Col4[match(TabA$Col2[TabA$Col2 %in% TabB$Col3], TabB$Col3)]

Using setkey and joins:

TabA[, Index := 1:nrow(TabA)]
setkey(TabA, Col1)
TabA[TabB, nomatch = 0, Col1 := as.character(Col4)]
setkey(TabA, Col2)
TabA[TabB, nomatch = 0, Col2 := as.character(Col4)]
setkey(TabA, Index)
TabA[, Index := NULL]

All of these assume that some items in Col1 and Col2 won't be matched. The code could be optimised if this isn't true. This is also the reason that Col4 has to be coerced to character.

We'll have both functionalities (updating by reference and default R behaviour).. as I've [detailed here](http://stackoverflow.com/a/27718317/559784)... We haven't gotten to it yet. — Arun, Jun 30 '15 at 22:35
I upvoted and much appreciated Minh Ha's solution, but accept this one because I think it is a more elegant approach. However, the last solution still remains unclear to me, as I don't get the need of the `nomatch = 0`. Thanks for the help. — Faabiioo, Jul 02 '15 at 20:45
No worries @Faabiioo. The `nomatch=0` is to perform an inner join. Without that, your joined table will have rows where there are no values in `TabA` matching the relevant column in `TabB` which is unnecessary and will also throw an error if the number of rows exceeds the number of rows of the max of `TabA` and `TabB`. — Nick Kennedy, Jul 02 '15 at 20:50

score 1 · Answer 2 · answered Jul 01 '15 at 06:31

I usually use by=1:nrow(DT) when I want to iterate throw each row of datatabe

library(data.table)

TabA <- data.table(Col1=c('A', 'B', 'C', 'D'), Col2=c('B', 'A', 'D', 'C'))
TabB <- data.table(Col3=c('A', 'B', 'C'), Col4=c(1, 2, 3))

TabA[, .(Col1=as.character(ifelse(nrow(TabB[Col3==Col1]) > 0, TabB[Col3==Col1]$Col4, Col1)), 
         Col2=as.character(ifelse(nrow(TabB[Col3==Col2]) > 0, TabB[Col3==Col2]$Col4, Col2))
         ), 
     by=1:nrow(TabA)][, .(Col1, Col2)]

or

TabA[, `:=`(Col1=as.character(ifelse(nrow(TabB[Col3==Col1]) > 0, TabB[Col3==Col1]$Col4, Col1)), 
            Col2=as.character(ifelse(nrow(TabB[Col3==Col2]) > 0, TabB[Col3==Col2]$Col4, Col2))
            ), 
     by=1:nrow(TabA)]

score 0 · Answer 3 · answered Jun 30 '15 at 16:27

0

Since I don't have your data I made a little example but I think this what you are looking for:

x<- c("a","b","c","d") 
y<-c("a","d","e","f")
z<-c("z1","z2","z3","z4")
x[x %in% y]<-z[x %in% y]

answered Jun 30 '15 at 16:27

Andrelrms

819
9
13

RHertel · Answer 4 · 2015-07-01T06:27:49.100

0

Here's one suggestion that uses only base R.

the_equals <- TabA$Col1[which(TabA$Col1 %in% TabB$Col3)]
nequals <- length(the_equals)
if(nequals>0) {
 idx1 <- lapply(1:nequals, function(x) which(TabA$Col1==the_equals[x]))
 idx2 <- lapply(1:nequals, function(x) which(TabB$Col3==the_equals[x]))
 Col2_new <- sapply(1:nequals, function(x) TabB$Col4[idx2[[x]][1]])
 for(i in 1:nequals) TabA$Col2[idx1[[i]]] <- Col2_new[i]
}

Hope this helps.

edited Jul 01 '15 at 06:27

answered Jul 01 '15 at 05:04

RHertel

23,412
5
38
64

Definitely a good solution, thanks. But what I meant was something more "_R-ish_". Minh Ha solution is likely what I was looking for – Faabiioo Jul 01 '15 at 13:25

A more elegant way to iterate over two data.tables' rows

4 Answers4

Solution to rewritten question