3

Given a data.table in R, I want to find rows that are the reversed version of a previous row. For example:

>head(DT)
   V1      V2 
 1 nameA   nameB 
 2 nameA   nameC
 3 nameB   nameA
 4 nameB   nameF
 5 nameN   nameP
 6 nameP   nameN

In the case of row 1, the code should return row 3. In the case of row 5, the code should return row 6. Eventually, I want to drop the "reversed" rows.

The real dataset has 0.5 million rows and 2 columns. At the moment I am using this piece of code, which does the job:

require(foreach)
require(doMC)
registerDoMC(4)
rm.idx <- c()
rm.idx <- foreach(i=1:nrow(DT), .combine = 'c')%dopar%{
       if (!(i %in% rm.idx)) which(DT[i,1] == DT[,2] & DT[i,2] == DT[,1])
}      

The code "returns" a vector (rm.idx) that contains the indexes to those rows that are the reversed version of a previous row.

However, it takes a long time (more than 30min) for the relatively "small" size of the data set. I often find that R has some tweak or some function that does the trick much faster (or, also, that my code is not very efficient). Therefore, I am wondering if anyone knows a faster way of finding rows that are the reversed of a previous row.

Thanks in advance for your time.

Javier
  • 1,530
  • 4
  • 21
  • 48
  • Does it need to be only the immediately preceding row, or any row that came before it? – Andrew Taylor Jan 23 '15 at 14:42
  • @AndrewTaylor: Hi Andrew. No it does not. If you see the example data table I showed in the question, the first index returned by the code would have to be `3`, because `row 3` is the same as `reversed(row 1)`, and of course, `row 1` is not immediately preceding `row 3` . Overall, "any row that came before it" is fine for me. Thanks – Javier Jan 23 '15 at 14:53
  • 1
    How do the answers over [here](http://stackoverflow.com/questions/22756392/deleting-reversed-duplicates-with-r) compare to what you have, in terms of speed? – Andrew Taylor Jan 23 '15 at 15:01
  • Using foreach() here will massively SLOW your processing. – LauriK Jan 23 '15 at 15:02
  • 1
    Do you have only two columns, or you might have more? – Marat Talipov Jan 23 '15 at 15:07
  • @LauriK. I did try without it and 10000 rows and it was faster...but I will try again. – Javier Jan 23 '15 at 15:07
  • @MaratTalipov: only two – Javier Jan 23 '15 at 15:07
  • Do you want to drop those rows in the end? – talat Jan 23 '15 at 15:08
  • @docendodiscimus: yes, sorry, I should have specified this in the question. I have just edited it including a relevant sentence, but overall, yes, I do want to get rid of it. – Javier Jan 23 '15 at 15:12
  • 1
    @AndrewTaylor. Thanks, the answer here works for me in no time. http://stackoverflow.com/questions/22756392/deleting-reversed-duplicates-with-r – Javier Jan 23 '15 at 15:14
  • 1
    Try `DT[!duplicated(paste(pmin(V1, V2),pmax(V1,V2)))]` – akrun Jan 23 '15 at 15:19
  • @akrun: thanks, this is very similar to the answer that Andrew pointed before and also works. Thanks again. – Javier Jan 23 '15 at 15:20
  • @Javier But, I think this should be faster because of `pmax`, `pmin` – akrun Jan 23 '15 at 15:21
  • @akrun: just tried, it is!, thanks! (btw, note that you are missing a coma before the last `]`) :) . Many thanks – Javier Jan 23 '15 at 15:25
  • @Javier Actually, my solution was in `data.table`. So, it should work. `setDT(DT)[..` – akrun Jan 23 '15 at 15:26
  • @akrun: ah ok! but `setDT(DT)`...? what do you mean? you mean `setnames`? or do you mean literally typying `setDT(DT)[!duplicated(paste(pmin(V1, V2),pmax(V1,V2)))] ` – Javier Jan 23 '15 at 15:36
  • @Javier I meant if your `DT` object was `data.frame`, it should be converted to `data.table`. By typing that, I get the subset – akrun Jan 23 '15 at 15:37
  • @akrun. Yeah sorry, I am silly, just re-started R and did not load data.table. ¬¬ . Thanks again! – Javier Jan 23 '15 at 15:40
  • @Javier If you can show the benchmarks in your post, it would be great. – akrun Jan 23 '15 at 15:40
  • @akrun: you mean a comparison of the system.time measurements?? – Javier Jan 23 '15 at 15:56
  • @Javier `microbenchmark` would be more informative on a bigger dataset. But system.time gives some info. – akrun Jan 23 '15 at 15:57

1 Answers1

5

To find these, you can use some data.table functions, like this:

> dt <- data.table(V1 = c("A", "A", "B", "B", "N","P"), V2 = c("B","C","A","F","P","N"))
> dt
   V1 V2
1:  A  B
2:  A  C
3:  B  A
4:  B  F
5:  N  P
6:  P  N
> dt1 <- dt[, paste0(V1, V2)]
> dt1
[1] "AB" "AC" "BA" "BF" "NP" "PN"
> dt2 <- dt[, paste0(V2, V1)]
> dt2
[1] "BA" "CA" "AB" "FB" "PN" "NP"
> matches <- data.table(m = match(dt1, dt2))
> matches
    m
1:  3
2: NA
3:  1
4: NA
5:  6
6:  5
> which(matches[, .I > m])
[1] 3 6

I'm using the match() function, which is REALLY fast. So first I'm making these into character vectors both ways. Then I'm finding where the first character vector is found in the second one for the first time (I know it's confusing sentence). I want to make the result a data.table once again to utilize the .I there. I made a data.table with 600 000 rows and all of it worked in less than a second.

LauriK
  • 1,899
  • 15
  • 20
  • 2
    Hi Laurik. Thanks. Someone in the comments pointed towards another answer that also solved my question. I attach it here in case is of any use to you. Is very similar to your approach: http://stackoverflow.com/questions/22756392/deleting-reversed-duplicates-with-r . Thanks again. – Javier Jan 23 '15 at 15:17
  • Yeah. Depends on the size of the data. Solutions based on data.frame will work fairly fast with everything up to a few hundred thousand rows. Once you get to millions of rows, you really need data.table (and/or plyr/dplyr). It's been quite a learning curve for myself, but I've found that learning data.table package tricks is very beneficial. – LauriK Jan 23 '15 at 15:24