Find values that only occur once per row over large data.table or data.frame

Question

I'm trying to find the values and locations of the values that only occur once per row in a data.table. I found this code to fine the values per row:

How to find all values which only appear less than X times in a vector

I use that in the following code. I was wondering how I can make it go faster. Currently it takes this long over 1000 rows

one with apply:

system.time(apply((singletons),1, function(x) Filter(function (elem) length(which((x) == elem)) <= 1, (x))))
user  system elapsed 
18.528   0.000  18.543 

Rprof("asdas")
(apply((singletons),1, function(x) Filter(function (elem) length(which((x) == elem)) <= 1, (x))))
summaryRprof()


    $by.self
                           self.time self.pct total.time total.pct
    "=="                        0.08    23.53       0.08     23.53
    "as.character.default"      0.06    17.65       0.10     29.41
    "ls"                        0.06    17.65       0.06     17.65
    "which"                     0.04    11.76       0.26     76.47
    "as.character"              0.04    11.76       0.14     41.18
    "as.vector"                 0.04    11.76       0.04     11.76
    "lapply"                    0.02     5.88       0.28     82.35

    $by.total
                           total.time total.pct self.time self.pct
    "lapply"                     0.28     82.35      0.02     5.88
    "[.data.table"               0.28     82.35      0.00     0.00
    "["                          0.28     82.35      0.00     0.00
    "Filter"                     0.28     82.35      0.00     0.00
    "unlist"                     0.28     82.35      0.00     0.00
    "which"                      0.26     76.47      0.04    11.76
    "FUN"                        0.26     76.47      0.00     0.00
    "as.character"               0.14     41.18      0.04    11.76
    "as.character.default"       0.10     29.41      0.06    17.65
    "=="                         0.08     23.53      0.08    23.53
    "ls"                         0.06     17.65      0.06    17.65
    ".completeToken"             0.06     17.65      0.00     0.00
    "apropos"                    0.06     17.65      0.00     0.00
    "normalCompletions"          0.06     17.65      0.00     0.00
    "as.vector"                  0.04     11.76      0.04    11.76

    $sample.interval
    [1] 0.02

    $sampling.time
    [1] 0.34

one within data.table

system.time(singletons[, Filter(function (elem) length(which(as.character(.SD) == elem)) <= 1, as.character(.SD)) , by = ID  ])
user  system elapsed 
25.064   0.000  25.085 

Rprof("asdas")
singletons[, Filter(function (elem) length(which(as.character(.SD) == elem)) <= 1, as.character(.SD)) , by = ID  ]
summaryRprof()

$by.self
                           self.time self.pct total.time total.pct
    "=="                        0.08    23.53       0.08     23.53
    "as.character.default"      0.06    17.65       0.10     29.41
    "ls"                        0.06    17.65       0.06     17.65
    "which"                     0.04    11.76       0.26     76.47
    "as.character"              0.04    11.76       0.14     41.18
    "as.vector"                 0.04    11.76       0.04     11.76
    "lapply"                    0.02     5.88       0.28     82.35

    $by.total
                           total.time total.pct self.time self.pct
    "lapply"                     0.28     82.35      0.02     5.88
    "[.data.table"               0.28     82.35      0.00     0.00
    "["                          0.28     82.35      0.00     0.00
    "Filter"                     0.28     82.35      0.00     0.00
    "unlist"                     0.28     82.35      0.00     0.00
    "which"                      0.26     76.47      0.04    11.76
    "FUN"                        0.26     76.47      0.00     0.00
    "as.character"               0.14     41.18      0.04    11.76
    "as.character.default"       0.10     29.41      0.06    17.65
    "=="                         0.08     23.53      0.08    23.53
    "ls"                         0.06     17.65      0.06    17.65
    ".completeToken"             0.06     17.65      0.00     0.00
    "apropos"                    0.06     17.65      0.00     0.00
    "normalCompletions"          0.06     17.65      0.00     0.00
    "as.vector"                  0.04     11.76      0.04    11.76

    $sample.interval
    [1] 0.02

    $sampling.time
    [1] 0.34

Any help in figuring out how to make it go faster would be much appreciated.

Also I'm looking to find the positions of those thing that only occur once in the row, so if anyone has good ideas about that let me know.

edit: data notes about the data, every line only has one value that occurs once its not always in columns two

I got rid of the first three columns:

V1 V2 V3 V4 V5 V6 V7 V8
./    T/G    T/T     ./    T/T    T/T    T/T     ./
./    G/T    G/G     ./    G/G    G/G    G/G     ./
./    C/A    C/C    C/C    C/C    C/C    C/C     ./
./    G/T    G/G    G/G    G/G    G/G    G/G     ./
./    G/C    G/G    G/G    G/G    G/G    G/G     ./
A/A    A/T    A/A    A/A    A/A    A/A    A/A    A/A

desired output:

character vector containing the values that only occur once per row.

So:

("T/G", "G/T", ...)

or if someone figures out the indices part than a data.frame (the row column not necessary):

singleton row column
"T/G" 1 2
"G/T" 2 2
.......
.......
.......

See `?table`. You might find something like `which(table(row(singletons), unlist(singletons)) == 1L, arr.ind = TRUE)` useful. — alexis_laz, Mar 03 '16 at 19:25
@alexis_laz very cool but doesn't get me the indices in the original matrix seems like its the right direction though. I'll keep chugging and see if I find something — njBernstein, Mar 03 '16 at 19:37
@njBernstein : You're right; it returns only the values - extra steps are needed to match to the respective positions. See, also, [this post](http://stackoverflow.com/questions/12495345/find-indices-of-duplicated-rows) to locate unique values - e.g.: `apply(singletons, 1, function(x) which(!(duplicated(x) | duplicated(x, fromLast = TRUE))))` should be more efficient than your approach even if it includes a simple by-row looping because it avoids the "ncol * ncol" looping in each row. — alexis_laz, Mar 03 '16 at 19:56
That gave you the desired output? It just gives me some matrix. — David Arenburg, Mar 03 '16 at 20:13
Not exactly. Gives me the indices in a list very quickly. Which allows me to just use a lookup using the indices. Not exactly what I wanted but I can make it work because of how fast it runs. @DavidArenburg I appreciate your help in formulating my thoughts. Thanks — njBernstein, Mar 03 '16 at 20:21
Either way, if you want to keep it in a `data.table` world (`apply` converts to a matrix) and avoid by row operations, you could transform to a long format and work on a single column. Try this for example `melt(singletons[, Row := .I], "Row")[, if(.N == 1) .SD, by = .(Row, value)]` — David Arenburg, Mar 03 '16 at 20:51

David Arenburg · Accepted Answer · 2016-03-03T22:35:48.220

I would suggest instead of operating by row and converting your data set to a matrix using apply, just transfer it to a long format an operate on a single column

melt(singletons[, Row := .I], "Row")[, 
                                     if(.N == 1L) .(Column = variable), 
                                     by = .(Row, value)]
#    Row value Column
# 1:   1   T/G     V2
# 2:   2   G/T     V2
# 3:   3   C/A     V2
# 4:   4   G/T     V2
# 5:   5   G/C     V2
# 6:   6   A/T     V2

Some benchmarks- so except it being the only function that gives a readable output, this is by far the fastest

set.seed(123)
N <- 1e4
BIDsingletons <- as.data.table(matrix(sample(unlist(singletons), N, replace = TRUE), ncol = N/1e2))
BIDsingletons2 <- copy(BIDsingletons)
DT <- function(BIDsingletons2) melt(BIDsingletons2[, Row := .I], "Row")[, 
                                     if(.N == 1L) .(Column = variable), 
                                     by = .(Row, value)]
OP <- function(BIDsingletons) apply(BIDsingletons, 1, function(x) Filter(function (elem) length(which((x) == elem)) <= 1, (x)))
Alexis_Laz <- function(BIDsingletons) apply(BIDsingletons, 1, function(x) which(!(duplicated(x) | duplicated(x, fromLast = TRUE))))

library(microbenchmark)
microbenchmark(DT(BIDsingletons2),
               #OP(BIDsingletons),
               Alexis_Laz(BIDsingletons))

# Unit: milliseconds
#                      expr       min        lq      mean    median        uq        max neval cld
#        DT(BIDsingletons2)  1.660324  1.911583  2.373655  2.093168  2.407389   8.150031   100  a 
#         OP(BIDsingletons) 57.763136 65.187614 72.071544 69.557509 76.446112 150.318052   100   b
# Alexis_Laz(BIDsingletons)  2.617990  2.847735  3.489971  3.052611  3.529667   8.605180   100  a

Comparing for a bit bigger data set

N <- 1e6
BIDsingletons <- as.data.table(matrix(sample(unlist(singletons), N, replace = TRUE), ncol = N/1e2))
BIDsingletons2 <- copy(BIDsingletons)
microbenchmark(DT(BIDsingletons2),
               # OP(BIDsingletons),
               Alexis_Laz(BIDsingletons))

# Unit: milliseconds
#                      expr       min        lq      mean   median        uq      max neval cld
#        DT(BIDsingletons2)  30.26517  33.79918  44.56996  36.2648  42.76773 128.8803   100  a 
# Alexis_Laz(BIDsingletons) 148.89655 213.85403 231.91895 232.7776 249.27168 325.6523   100   b

Very interesting. The use melt and then going by row, value is real nice. Thanks so much. I hadn't seen the .I operator before. Very nice — njBernstein, Mar 04 '16 at 17:04

Find values that only occur once per row over large data.table or data.frame

1 Answers1