1

I am dealing with some large data in R:

I have a vector of normally distributed random numbers with length about 6400*50000, I need to sum every 4 elements in this vector to get a smaller one.

Is there any efficient way to do this in R?

My thoughts till now:

  1. using a matrix with ncol=10 and use apply function-- failed because the matrix size is too big;
  2. Try paralell and foreach package but no progress yet;

example code:

library(parallel)
library(RcppZiggurat)
library(doParallel)
library(foreach)

coreNums<-detectCores()
N1=6400
M=4
N2=N1/M
cl<-makeCluster(getOption("cl.cores", coreNums))
registerDoParallel(cl)
vector1<-zrnorm(N1*K)
vector2=foreach(i=1:(N2*K)) %dopar% {sum(vector1[M*(i-1)+1:M*i])}
vector2=unlist(vector)
Lei_Xu
  • 155
  • 7
  • Please make this a [reproducible question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by producing a *small* sample vector, your expected output, and what code you've tried so far. – r2evans Apr 26 '17 at 04:50
  • Updated with a block of codes. – Lei_Xu Apr 26 '17 at 05:00

1 Answers1

2

I think colSums is the function you are looking for.

vector1 = rnorm(1000*50000)
dim(vector1) = c(10, length(vector1)/10)
vector2 = colSums(vector1)

In my opinion, the task is too simple for parallelization. Also, I did not get any problems with the matrix size.

If you want to use less memory, here is the code doing the same in parts of 10,000 values in vector1.

vector2 = double(length(vector1)/10);
for( i in seq_len(length(vector2)/10000) ){
    part = vector1[((i-1)*10000+1):(i*10000)]
    dim(part) = c(10, 1000)
    vector2[((i-1)*1000+1):(i*1000)] = colSums(part)
}
Andrey Shabalin
  • 4,389
  • 1
  • 19
  • 18
  • This works for the case I provided in the question, but if I need larger data, the method can't work. My computer reports"cannot allocate vector of size 7.6 Gb". What should I deal with such question? – Lei_Xu Apr 26 '17 at 17:47
  • First, my code works on my computer even with `6400 x 50000` matrix. I just have a computer with more memory. – Andrey Shabalin Apr 26 '17 at 17:55
  • Second, if you want to use less memory, you can do this task by parts of, say, 10000 values in `vector1`. I'll update the solution. – Andrey Shabalin Apr 26 '17 at 18:01