0

I have a problem computing a reduction of a function of a network represented by a large (200000x200000) matrix generated as a distance matrix between pairs of points.

Minimal example, input X a 200000x2 numpy array of cartesian coordinates:

x = tf.constant(X[:,0], shape=[X.shape[0],1])
y = tf.constant(X[:,1], shape=[X.shape[0],1])
dx = x - tf.transpose(x)
dy = y - tf.transpose(y)
D = tf.sqrt(dx*dx + dy*dy)
M = 0.1 * 5.0 / tf.pow(4.0 + D, 1.5)
res = tf.reduce_sum(betaM)

Running on the CPU, the memory (16GB on my MBP) is quickly oversubscribed and the system grinds to a halt. Presumably tf is trying to store the whole of D (and M?) in memory.

If I were writing this in C/C++, I would most likely loop over the matrix rows, summing each row as I go and never storing the whole matrix. Ditto the GPU -- I'd subdivide the (virtual) matrix and perform the reduction in chunks.

Is there a trick to getting tf to follow a more chunk-wise behaviour, economising on memory?

Cheers,

Chris

EDIT:

An alternative approach which copes with the memory issue is to use tf.map_fn:

rowsums = tf.map_fn(lambda i: tf.reduce_sum(tf.sqrt(tf.reduce_sum(tf.pow(i - x,2),1))) , x)
res = tf.reduce_sum(rowsums)

Thus only the rowsums are stored as a tensor, and not the full distance matrix. However, although this approach works well on the CPU, it grinds to a halt on the GPU.

1 Answers1

1

What's really needed here (but not yet implemented) is cwise fusion. What's happening right now is that 2*sqrt(a+b) will allocate new Tensor for a+b, then new tensor for sqrt and then another one for 2*sqrt. PS, you can dig where the memory is going by examining memory allocation messages (need verbose logging)

You could make things more memory efficient by using variables and assign_add to incrementally update things without creating many intermediate tensors. There's an alternative formula for computing "all pairwise distances" here which may be easier to convert to this form

Community
  • 1
  • 1
Yaroslav Bulatov
  • 57,332
  • 22
  • 139
  • 197
  • Hi Yaroslav, could you be a little more specific about how you estimate memory usage? I have that D maximally requires 8*200000^2 ~ 300GB storage, and about half that (ignoring the diagonal) as a triangular matrix. Thanks! – Chris Jewell Jun 02 '16 at 09:06
  • Doh, I was off by 1000. Yes, that seems infeasible to store in memory. You could treat your data the same way that TF treats datasets during training, only loading them in chunks. You can use two `SliceInputProducer` + `batch` sets to produce chunks, and a doubly nested loop to iterate over them – Yaroslav Bulatov Jun 02 '16 at 15:13
  • To be more detailed, you could use `SliceInputProducer`+`batch`+`assign` to save a subset of points into variable `subset1` on each run call (run1), you then also have a separate `SliceInputProducer`+`batch`+`assign` to save subset of points into variable `subset2` on each run call (run2). You do run1 in outer loop and run2 in the inner loop. Finally you have some logic which takes `subset1` and `subset2` variables and compute all pairwise distances between them and adds to your total, that's your third run command (run3) – Yaroslav Bulatov Jun 02 '16 at 16:47