My Goal
So I'm trying to call the normalize_quantiles
function from the preprocessCore
R package (R-3.2.1
) from within a python3 script using the rpy2
package on an enormous matrix (10 gb+ files). I have virtually unlimited memory. When I run it through R itself with the following, I am able to complete the normalization and print to output:
require(preprocessCore);
all <- data.matrix(read.table("data_table.txt",sep="\t",header=TRUE));
all[,6:57]=normalize.quantiles(all[,6:57]);
write.table(all,"QN_data_table.txt",sep="\t",row.names=FALSE);
I'm trying to build this into a python script that also does other things using the rpy2
python package, but I'm having trouble with the way it builds matrices. An example is below:
matrix = sample_list # My 2-d python array containing the data.
v = robjects.FloatVector([ element for col in matrix for element in col ])
m = robjects.r['matrix'](v, ncol = len(matrix), byrow=False)
print("Performing quantile normalization.")
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
norm_matrix = np.array(Rnormalized_matrix)
return header, pos_list, norm_matrix
The Issue
This works fine for smaller files, but when I run it on my huge files, it dies with the error: rpy2.rinterface.RRuntimeError: Error: cannot allocate vector of size 9.7 Gb
I know the max size of a vector for R is 8 Gb, which explains why the above error is being thrown. The rpy2 docs say:
"A Matrix is a special case of Array. As with arrays, one must remember that this is just a vector with dimension attributes (number of rows, number of columns)."
I sort of wondered how strictly it adhered to this, so I changed my code to initialize a matrix of the size I wanted and then iterate through and assign the elements to the values:
matrix = sample_list # My 2-d python array of data.
m_count = 1
m = robjects.r['matrix'](0.0, ncol=len(matrix), nrow=len(matrix[0]))
for samp in matrix:
i_count = 1
for entry in samp:
m.rx[i_count, m_count] = entry # Assign the data to the element.
i_count += 1
m_count += 1
print("Performing quantile normalization.")
Rnormalized_matrix = preprocessCore.normalize_quantiles(m)
norm_matrix = np.array(Rnormalized_matrix)
return header, pos_list, norm_matrix
Again, this works for smaller files, but crashes with the same error as previous.
So my question is what's the underlying difference that allows for the assignment of huge matrices in R but causes issues in rpy? Is there a different way I need to approach this? Should I just suck it up and do it in R? Or is there a way to circumvent the issues I'm having?