0

I have a large number of individuals (30,000), each with a large number of genetic positions (in the format CHR, POS, REF, ALT):

require(data.table)
ind1 <- data.table('CHR' = c('chr1','chr2','chr3','chr4'),
                   'POS' = c(10111,343243,89128312,843242),
                   'REF' = c('A','T','C','G'),
                   'ALT' = c('G','A','T','C'))

ind2 <- data.table('CHR' = c('chr1','chr2','chr3','chr4'),
                   'POS' = c(48392423,343243,49347898239,8432423),
                   'REF' = c('C','T','A','G'),
                   'ALT' = c('T','A','G','C'))

I want to create a set of all unique variants present in these individuals. I'm currently attempting this using foreach:

require(foreach)
ind_list <- list(ind1,ind2)


combineFunction <- function( x, ... ) {
  
  unique(mapply(rbind,x,...,simplify=FALSE))
  
}

inds = c('ind1','ind2')

out <- foreach( ind = 1:length(inds), .combine = 'combineFunction', .inorder = FALSE, .multicombine = TRUE ) %do% {
  
  data <- ind_list[[ind]]
  return(data)
  
}

However, this doesn't give me the desired output:

    CHR     POS           REF     ALT    
[1,] "chr1"  "10111"       "A"     "G"    
[2,] "chr1"  "48392423"    "C"     "T"    
[3,] "FALSE" "0"           "FALSE" "FALSE"
[4,] "chr2"  "343243"      "T"     "A"    
[5,] "chr3"  "89128312"    "C"     "T"    
[6,] "chr3"  "49347898239" "A"     "G"    
[7,] "chr4"  "843242"      "G"     "C"    
[8,] "chr4"  "8432423"     "G"     "C"

My desired output is equivalent to unique(rbind(ind1,ind2)), or:

   CHR         POS REF ALT
1: chr1       10111   A   G
2: chr2      343243   T   A
3: chr3    89128312   C   T
4: chr4      843242   G   C
5: chr1    48392423   C   T
6: chr3 49347898239   A   G
7: chr4     8432423   G   C

This needs to be done while reading in the data due to memory constraints (30,000 individuals x 200,000+ variants per individual, the majority of which are not unique). Thanks in advance for any help or suggestions. This code also needs to be pretty efficient, so any pointers there are also more than welcome. If there's an easier bash solution I'm also happy to try that!

  • FYI, if the files are CSV or similar files, you might go much faster doing the unique-part in a shell-script using `sort` and `uniq` with specific fields. It's not as straight-forward in some ways, but it can be much faster. \*shrug\* – r2evans Aug 03 '22 at 13:11
  • 1
    This is exactly what I ended up doing ;) I did some speed tests on a small subset of my cohort and depending on file size I found a system() call to bash to be between 3-5x faster than even a parallelised foreach() on a modest number of CPUs. – undercover_camel Aug 03 '22 at 13:25

1 Answers1

1

Use do.call in place of mapply:

combineFunction <- function(...) {
  unique(do.call(rbind, list(...)))
}

out <- foreach( ind = 1:length(ind_list), .combine = 'combineFunction', .inorder = FALSE, .multicombine = TRUE ) %do% {
  ind_list[[ind]]
}
out
#       CHR         POS    REF    ALT
#    <char>       <num> <char> <char>
# 1:   chr1       10111      A      G
# 2:   chr2      343243      T      A
# 3:   chr3    89128312      C      T
# 4:   chr4      843242      G      C
# 5:   chr1    48392423      C      T
# 6:   chr3 49347898239      A      G
# 7:   chr4     8432423      G      C

I don't know that foreach is really doing anything for you there, since there is no computation being done on separate nodes. You can get the same effect with

out <- Reduce(combineFunction, ind_list)

though I have not done memory benchmarking to prove my belief.

BTW, you should almost always use library, not require. The latter never stops following code when the package is not available, which is almost never what is intended. If you want to use require, then you need to check its return value, and if false then do something like fail in an informative way ... which is what library does. Refs: https://stackoverflow.com/a/51263513.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    In my actual use case I'm reading data on each iteration of foreach() and using %dopar% with a parallel backend :). This was very helpful nonetheless, so thank you! – undercover_camel Aug 03 '22 at 13:05
  • That makes sense ... it seemed a bit contrived to use `foreach` to do a list-lookup, so I guessed there might be something more to it. (Then again, I have seen over-eager code using some overly-complicated mechanism just to do a `Reduce` thing.) – r2evans Aug 03 '22 at 13:09