0

My Goal

So I'm trying to call the normalize_quantiles function from the preprocessCore R package (R-3.2.1) from within a python3 script using the rpy2 package on an enormous matrix (10 gb+ files). I have virtually unlimited memory. When I run it through R itself with the following, I am able to complete the normalization and print to output:

require(preprocessCore);
all <- data.matrix(read.table("data_table.txt",sep="\t",header=TRUE));
all[,6:57]=normalize.quantiles(all[,6:57]);
write.table(all,"QN_data_table.txt",sep="\t",row.names=FALSE);

I'm trying to build this into a python script that also does other things using the rpy2 python package, but I'm having trouble with the way it builds matrices. An example is below:

matrix = sample_list  # My 2-d python array containing the data.
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
print("Performing quantile normalization.")
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
norm_matrix = np.array(Rnormalized_matrix)

return header, pos_list, norm_matrix

The Issue

This works fine for smaller files, but when I run it on my huge files, it dies with the error: rpy2.rinterface.RRuntimeError: Error: cannot allocate vector of size 9.7 Gb

I know the max size of a vector for R is 8 Gb, which explains why the above error is being thrown. The rpy2 docs say:

"A Matrix is a special case of Array. As with arrays, one must remember that this is just a vector with dimension attributes (number of rows, number of columns)."

I sort of wondered how strictly it adhered to this, so I changed my code to initialize a matrix of the size I wanted and then iterate through and assign the elements to the values:

matrix = sample_list  # My 2-d python array of data.
m_count = 1
m = robjects.r['matrix'](0.0, ncol=len(matrix), nrow=len(matrix[0]))
for samp in matrix:
    i_count = 1
    for entry in samp:
        m.rx[i_count, m_count] = entry  # Assign the data to the element.
        i_count += 1
    m_count += 1

print("Performing quantile normalization.")

Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
norm_matrix = np.array(Rnormalized_matrix)

return header, pos_list, norm_matrix

Again, this works for smaller files, but crashes with the same error as previous.

So my question is what's the underlying difference that allows for the assignment of huge matrices in R but causes issues in rpy? Is there a different way I need to approach this? Should I just suck it up and do it in R? Or is there a way to circumvent the issues I'm having?

Community
  • 1
  • 1
Jared Andrews
  • 272
  • 2
  • 12

1 Answers1

1

At its root, R is a functional language. This means that when doing in R

m[i, j] <- 123

what is happening is something like

assign_to_cell <- `[<-`
m <- assign_to_cell(m, i, j, 123)

where the arguments are passed by value.

This mean that a new matrix m should be returned with the cell at (i,j) containing the new value. Making a copy of m with each assignment is going to be rather inefficient, particularly with larger objects as you are experiencing it, so the R interpreter has a nifty trick (see R's C source for details): the left side of the expression is compared to the right side of the expression and if the object is the same the interpreter will know that the copy is unnecessary and the modification can be done "in place".

Now with rpy2 there are two additional points to consider: while Python is (mostly) passing arguments by reference, the embedded R engine has no way to know what is happening on the left side of a Python expression.

The expression

m.rx[i_count, m_count] = entry 

is faithfully building an R call like

m <- assign_to_cell(m, i, j, entry)

but with the ability for R to look ahead on the left side of the expression lost. The result is that a copy of m is made. With each modification.

However, rpy2 is making it easy to move vectors, matrices, and arrays defined in R to Python's pass-by-reference world. For example these R objects can be aliased to corresponding numpy objects (using asarra - see http://rpy.sourceforge.net/rpy2/doc-2.0/html/numpy.html#low-level-interface). Remembering that R arrays are in column-major order, one can also compute the index and skip the aliasing to numpy array and modify cells in-place with:

m[idx] = entry 

Note:

I think that the limitation to 8Gb, caused by R indices being 32bit integers if I remember it correctly, was lifted a couple of years ago. You might have less unlimited memory than you believe. The physical memory on a system does not necessarily mean that one can allocate all of it in a contiguous block.

lgautier
  • 11,363
  • 29
  • 42
  • Ah, this is somewhat clarifying. I didn't realize you could assign to the index like that. I assume it would work for my purpose, but I just used the the `robjects.r` functionality to create the function in native R code (similar to your answer to [this question](http://stackoverflow.com/questions/30754885/prepare-a-python-string-for-r-using-rpy2?lq=1)). The indexing above is likely a more elegant solution though, so I will mark it as the answer. Thanks for your responsiveness (and creating/maintaining rpy2!). It is a huge boon for those of us that find R useful, but unwieldy at times. – Jared Andrews Jun 12 '16 at 19:32