I'm trying to find the values and locations of the values that only occur once per row in a data.table. I found this code to fine the values per row:
How to find all values which only appear less than X times in a vector
I use that in the following code. I was wondering how I can make it go faster. Currently it takes this long over 1000 rows
one with apply:
system.time(apply((singletons),1, function(x) Filter(function (elem) length(which((x) == elem)) <= 1, (x))))
user system elapsed
18.528 0.000 18.543
Rprof("asdas")
(apply((singletons),1, function(x) Filter(function (elem) length(which((x) == elem)) <= 1, (x))))
summaryRprof()
$by.self
self.time self.pct total.time total.pct
"==" 0.08 23.53 0.08 23.53
"as.character.default" 0.06 17.65 0.10 29.41
"ls" 0.06 17.65 0.06 17.65
"which" 0.04 11.76 0.26 76.47
"as.character" 0.04 11.76 0.14 41.18
"as.vector" 0.04 11.76 0.04 11.76
"lapply" 0.02 5.88 0.28 82.35
$by.total
total.time total.pct self.time self.pct
"lapply" 0.28 82.35 0.02 5.88
"[.data.table" 0.28 82.35 0.00 0.00
"[" 0.28 82.35 0.00 0.00
"Filter" 0.28 82.35 0.00 0.00
"unlist" 0.28 82.35 0.00 0.00
"which" 0.26 76.47 0.04 11.76
"FUN" 0.26 76.47 0.00 0.00
"as.character" 0.14 41.18 0.04 11.76
"as.character.default" 0.10 29.41 0.06 17.65
"==" 0.08 23.53 0.08 23.53
"ls" 0.06 17.65 0.06 17.65
".completeToken" 0.06 17.65 0.00 0.00
"apropos" 0.06 17.65 0.00 0.00
"normalCompletions" 0.06 17.65 0.00 0.00
"as.vector" 0.04 11.76 0.04 11.76
$sample.interval
[1] 0.02
$sampling.time
[1] 0.34
one within data.table
system.time(singletons[, Filter(function (elem) length(which(as.character(.SD) == elem)) <= 1, as.character(.SD)) , by = ID ])
user system elapsed
25.064 0.000 25.085
Rprof("asdas")
singletons[, Filter(function (elem) length(which(as.character(.SD) == elem)) <= 1, as.character(.SD)) , by = ID ]
summaryRprof()
$by.self
self.time self.pct total.time total.pct
"==" 0.08 23.53 0.08 23.53
"as.character.default" 0.06 17.65 0.10 29.41
"ls" 0.06 17.65 0.06 17.65
"which" 0.04 11.76 0.26 76.47
"as.character" 0.04 11.76 0.14 41.18
"as.vector" 0.04 11.76 0.04 11.76
"lapply" 0.02 5.88 0.28 82.35
$by.total
total.time total.pct self.time self.pct
"lapply" 0.28 82.35 0.02 5.88
"[.data.table" 0.28 82.35 0.00 0.00
"[" 0.28 82.35 0.00 0.00
"Filter" 0.28 82.35 0.00 0.00
"unlist" 0.28 82.35 0.00 0.00
"which" 0.26 76.47 0.04 11.76
"FUN" 0.26 76.47 0.00 0.00
"as.character" 0.14 41.18 0.04 11.76
"as.character.default" 0.10 29.41 0.06 17.65
"==" 0.08 23.53 0.08 23.53
"ls" 0.06 17.65 0.06 17.65
".completeToken" 0.06 17.65 0.00 0.00
"apropos" 0.06 17.65 0.00 0.00
"normalCompletions" 0.06 17.65 0.00 0.00
"as.vector" 0.04 11.76 0.04 11.76
$sample.interval
[1] 0.02
$sampling.time
[1] 0.34
Any help in figuring out how to make it go faster would be much appreciated.
Also I'm looking to find the positions of those thing that only occur once in the row, so if anyone has good ideas about that let me know.
edit: data notes about the data, every line only has one value that occurs once its not always in columns two
I got rid of the first three columns:
V1 V2 V3 V4 V5 V6 V7 V8
./ T/G T/T ./ T/T T/T T/T ./
./ G/T G/G ./ G/G G/G G/G ./
./ C/A C/C C/C C/C C/C C/C ./
./ G/T G/G G/G G/G G/G G/G ./
./ G/C G/G G/G G/G G/G G/G ./
A/A A/T A/A A/A A/A A/A A/A A/A
desired output:
character vector containing the values that only occur once per row.
So:
("T/G", "G/T", ...)
or if someone figures out the indices part than a data.frame (the row column not necessary):
singleton row column
"T/G" 1 2
"G/T" 2 2
.......
.......
.......