4

So I have these ginormous matrices X and Y. X and Y both have 100 million rows, and X has 10 columns. I'm trying to implement linear regression with these matrices, and I need the quantity (X^T*X)^-1 * X^T * Y. How can I compute this as space-efficiently as possible?

Right now I have

X = readMatrix("fileX.txt")
Y = readMatrix("fileY.txt")
return (X.getT() * X).getI() * X.getT() * Y

How many matrices are being stored in memory here? Are more than two matrices being stored at once? Is there a better way to do it?

I have about 1.5 GB of memory for this project. I can probably stretch it to 2 or 2.5 if I close every other program. Ideally the process would run in a short amount of time also, but the memory bound is more strict.

The other approach I've tried is saving the intermediate steps of the calculation as text files and reloading them after every step. But that is very slow.

i love stackoverflow
  • 1,555
  • 3
  • 12
  • 24
  • 1
    [Various approaches for working with very large matrices with numpy](http://stackoverflow.com/questions/1053928/python-numpy-very-large-matrices) – Francis Avila Oct 30 '12 at 22:07
  • The `out` argument to functions like `np.dot` (and most numpy functions), allows you specify an output array for the result. In that case no hidden temporaries are created and you are more memory aware. Do not use text files (they are just slow and bloated), if you go that way, you should use `memmap`s probably. – seberg Oct 30 '12 at 23:44

3 Answers3

3

the size of X is 100e6 x 10 the size of Y is 100e6 x 1

so the final size of (X^T*X)^-1 * X^T * Y is 10 x 1

you can calculate it by following step:

  1. calculate a = X^T*X -> 10 x 10
  2. calculate b = X^T*Y -> 10 x 1
  3. calculate a^-1 * b

matrixs in step 3 is very small, so you just need to do some intermediate steps to calculate 1 & 2.

For example you can read column 0 of X and Y, and calculate it by numpy.dot(X0, Y).

for float64 dtype, the size of X0 and Y is about 1600M, if it cann't fit the memory, you can call numpy.dot twice for the first half and second half of X0 & Y separately.

So to calculate X^T*Y you need call numpy.dot 20 times, to calculate X^T*X you need call numpy.dot 200 times.

HYRY
  • 94,853
  • 25
  • 187
  • 187
3

A neat property of ordinary least squares regression is that if you have two datasets X1, Y1 and X2, Y2 and you have already computed all of

  • X1' * X1
  • X1' * Y1
  • X2' * X2
  • X2' * Y2

And you now want to do the regression on the combined dataset X = [X1; X2] and Y = [Y1; Y2], you don't actually have to recompute very much. The relationships

  • X' * X = X1' * X1 + X2' * X2
  • X' * Y = X1' * Y1 + X2' * Y2

hold, so with these computed you just calculate

  • beta = inv(X' * X) * (X' * Y)

and you're done. This leads to a simple algorithm for OLS on very large datasets:

  • Load in part of the dataset (say, the first million rows) and compute X' * X and X' * Y (which are quite small matrices) and store them.
  • Keep doing this for the next million rows, until you have processed the whole dataset.
  • Add together all of the X' * Xs and X' * Ys that you have stored
  • Compute beta = inv(X' * X) \ (X' * Y)

That is not appreciably slower than loading in the entire dataset at once, and it uses far less memory.

Final note: you should never compute beta by first computing (X' * X) and finding its inverse (for two reasons - 1. it is slow, and 2. it is prone to numerical error).

Instead, you should solve the linear system -

  • (X' * X) * beta = X' * Y

In MATLAB this is a simple one-liner

beta = (X' * X) \ (X' * Y);

and I expect that numpy has a similar way of solving linear systems without needing to invert a matrix.

Chris Taylor
  • 46,912
  • 15
  • 110
  • 154
  • I have turned this algorithm into a Python package. It can be found here: https://github.com/cstich/PyBols – cstich Jul 19 '21 at 08:47
1

RAM's pretty cheap - you should consider investing. A system with 24 Gig of RAM doesn't necessarily cost an arm and a leg anymore - one of Dell's lower-end servers can pack in that much.

If the matrices are sparse (lots of zeros), use a sparse matrix class to save a lot of RAM.

If the matrices aren't sparse, you'll either want more RAM (or at least more Virtual Memory), or to do your matrix operations using disk files.

Disk files are of course an order of magnitude slower than RAM, and thrashing your virtual memory system could actually be worse than that, depending on your access patterns.

user1277476
  • 2,871
  • 12
  • 10