I have a large number of individuals (30,000), each with a large number of genetic positions (in the format CHR, POS, REF, ALT):
require(data.table)
ind1 <- data.table('CHR' = c('chr1','chr2','chr3','chr4'),
'POS' = c(10111,343243,89128312,843242),
'REF' = c('A','T','C','G'),
'ALT' = c('G','A','T','C'))
ind2 <- data.table('CHR' = c('chr1','chr2','chr3','chr4'),
'POS' = c(48392423,343243,49347898239,8432423),
'REF' = c('C','T','A','G'),
'ALT' = c('T','A','G','C'))
I want to create a set of all unique variants present in these individuals. I'm currently attempting this using foreach
:
require(foreach)
ind_list <- list(ind1,ind2)
combineFunction <- function( x, ... ) {
unique(mapply(rbind,x,...,simplify=FALSE))
}
inds = c('ind1','ind2')
out <- foreach( ind = 1:length(inds), .combine = 'combineFunction', .inorder = FALSE, .multicombine = TRUE ) %do% {
data <- ind_list[[ind]]
return(data)
}
However, this doesn't give me the desired output:
CHR POS REF ALT
[1,] "chr1" "10111" "A" "G"
[2,] "chr1" "48392423" "C" "T"
[3,] "FALSE" "0" "FALSE" "FALSE"
[4,] "chr2" "343243" "T" "A"
[5,] "chr3" "89128312" "C" "T"
[6,] "chr3" "49347898239" "A" "G"
[7,] "chr4" "843242" "G" "C"
[8,] "chr4" "8432423" "G" "C"
My desired output is equivalent to unique(rbind(ind1,ind2))
, or:
CHR POS REF ALT
1: chr1 10111 A G
2: chr2 343243 T A
3: chr3 89128312 C T
4: chr4 843242 G C
5: chr1 48392423 C T
6: chr3 49347898239 A G
7: chr4 8432423 G C
This needs to be done while reading in the data due to memory constraints (30,000 individuals x 200,000+ variants per individual, the majority of which are not unique). Thanks in advance for any help or suggestions. This code also needs to be pretty efficient, so any pointers there are also more than welcome. If there's an easier bash solution I'm also happy to try that!