2

I am working with large matrices, like the Movielens 20m dataset. I restructured the online file such that it matches the dimensions mentioned on the page (138000 by 27000), since the original file contains indices that are more of the size (138000 by 131000), but contain a lot of empty columns. Simply throwing out those empty columns and re-indexing yields the desired dimensions.

Anyways, the snippet to cast the sparse csv file to a dense format looks like this:

import pandas as pd
from scipy import sparse

# note that the file is not the one described in the link, but the smaller one
X = pd.read_csv("ml-20m-dense.dat", sep=",", header=None)
mat = sparse.coo_matrix((X[2], (X[0], X[1]))).todense()

Now, the estimated size in memory should be close to 138000 * 27000 * 8 / (1024^3) = 27.5 GB.
Yet, when I inspect the processes with htop, the memory consumption shows me around 7 GB only, although approximately 32 GB virtual memory are reserved.

At first I thought that this might be due to some "efficiency trick" by either pandas reader, or the scipy.sparse package, to circumvent blowing up memory consumption.
But even after I call my PCA function on it, it never increases the active memory consumption to the amount it should. Note that calling mat.nbytes returns the exact amount estimated, so it seems NumPy is at least aware of the data.


(PCA code for reference:)

from fbpca import pca
result = pca(mat, k=3, raw=False, n_iter=3)

Note that, although fbpca makes use of an randomized algorithm, and I am only computing the top three components, the code still performs a (single, but full) matrix multiplication of the input matrix with a (way smaller) random matrix. Essentially, it will still have to access every element in the input matrix at least once.

The last remark also makes this slightly different from posts I have found, like this, since in that post the elements are never really accessed.

dennlinger
  • 9,890
  • 1
  • 42
  • 63
  • 7GB + 32GB = 39GB, more than enough to hold the data. Why are you assuming the process isn't storing data in the 32GB? – user2699 Aug 08 '18 at 15:17
  • It is reserving virtual memory for the whole size, but never actually "directly using" it. IMO the active consumption should be at around 27 GB once I multiply the matrix, since that accesses every element. – dennlinger Aug 08 '18 at 16:01
  • I think you're misunderstanding what virtual memory is. – user2699 Aug 08 '18 at 17:38
  • Would you care to elaborate on that? I mean, it should still show as *active consumption*, when I multiply the matrix, right? – dennlinger Aug 08 '18 at 18:26
  • This might be a good place to start reading about it: https://serverfault.com/questions/138427/top-what-does-virtual-memory-size-mean-linux-ubuntu. Virtual doesn't mean memory that the process isn't using, it just means memory assigned to the process that is not in physical ram at the moment. Of course to perform a matrix multiplication that has to be loaded into ram and then to the CPU, but numpy multiplies only a few elements at a time. The operating system knows enough not to load the entire array into ram, only the parts currently being operated on. – user2699 Aug 08 '18 at 19:41
  • If that is indeed the case (something to read up on *that* would be more helpful, probably, or more relevant), I would still like to know why NumPy doesn't keep them in memory, or "what happens to the lost bits". I am not arguing against virtual memory, but for a full pass through the data, I don't see why it would not be loaded into memory (and then stay there)... Think of the question as "why is it using much less "active" memory than it should" – dennlinger Aug 08 '18 at 19:45
  • I'm sorry I don't have any references, I remember these things were part of an introductory operating systems course I took some years ago. To rephrase my previous answer to your question, it is using memory differently than you think it should because the operating system was designed that way by people who have spent a significant amount of time determining the best way to do it. – user2699 Aug 08 '18 at 20:01

1 Answers1

3

I think your problem lies in the todense() call, which uses np.asmatrix(self.toarray(order=order, out=out)) internally. toarray creates its output with np.zeros. (See toarray, _process_toarray_args)

So your question can be reduced to: Why doesn't np.zeros allocate enough memory?

The answer is probably lazy-initialization and zero pages:

Why does numpy.zeros takes up little space
Linux kernel: Role of zero page allocation at paging_init time

So all zero-regions in your matrix are actually in the same physical memory block and only a write to all entries will force the OS to allocate enough physical memory.

dennlinger
  • 9,890
  • 1
  • 42
  • 63
Domso
  • 970
  • 1
  • 10
  • 22
  • So what you're saying is that the density of the sparse matrix (i.e. number of non-zero elements) should have an influence on the initially consumed memory... Very interesting, I will try to test this with a slightly different variation of the matrix. – dennlinger Aug 09 '18 at 10:20
  • And the question still remains why it would not allocate that memory once I do a pass over it for the matrix multiplication. – dennlinger Aug 09 '18 at 10:21
  • @dennlinger not just the number of zeros, but the number of blocks/pages containing only zeros. Meaning long continuous sequences of zeros – Domso Aug 09 '18 at 10:35
  • 1
    @dennlinger do you modify the values while iteration of the data? If you only read, the zero-pages should remain. You could try to override the whole matrix with random non-zero values and then check for the RAM consumption again – Domso Aug 09 '18 at 10:42
  • Indeed, the calculation of fbpca seems to create a local copy of the rows/columns, but never overrides the original matrix. Thus, it seems the zero-pages still hold, which is reducing the memory consumption so much. – dennlinger Aug 20 '18 at 09:04