2

I have a large sparse matrix containing a histogram which I would like to plot as heatmap. Normally I would simply plot the full matrix (h) as follows:

import matplotlib.pyplot as plt
plt.imshow(h.T, interpolation="nearest", origin="lower")
plt.colorbar()
plt.savefig("corr.eps")

In this case I however have the problem that the full matrix would have the dimensions of 189,940x189,940 which is too large for me to hold in memory. I have found posts on plotting the sparsity pattern (e.g. python matplotlib plot sparse matrix pattern ) but nothing on how to plot the heatmap yet without converting it into a dense matrix. Is it possible to do so? (Or is there some other way of plotting it without running out of RAM?) My sparse matrix is currently a lilmatrix (scipy.sparse.lil_matrix).

P-M
  • 1,279
  • 2
  • 21
  • 35
  • Never tried it myself, but have you looked into `datashader`. Might be useful. – reptilicus Feb 13 '17 at 18:07
  • Have you considered simply plotting the points individually as a scatter or rectangle collection? – ImportanceOfBeingErnest Feb 13 '17 at 18:08
  • 1
    I would normally do that but in this case each cell contains a count so I don't care about whether a cell is populated but about what value it holds. I'm not sure how I would visualise that using scatter. I have not looked at `datashader` yet but shall have a look. – P-M Feb 13 '17 at 18:12
  • In a heatmap color is used to convey the information about the data value. Usually this is done by using a colormap. So setting the color of the scatter or collection items according to a colormap might be an option. – ImportanceOfBeingErnest Feb 13 '17 at 18:19
  • I see what you mean. Do you know how this might be implemented in practice? – P-M Feb 13 '17 at 18:24
  • So you have a large sparse matrix, with say, less than 1% of the values nonzero. And you plot a scatter plot with the dot colors dependent on their value. What would that look like? A widely scattered set of very small colored dots, more like the night sky than a colored surface? – hpaulj Feb 13 '17 at 18:55
  • The problem even if the dots are dense enough is that colours don't add up. For example imagine you have a cluster of 20 very close dots each indicating 5% of max value, and somewhere else a single pixel at max value. In a proper heatmap these two should look almost the same, because they represent the same total in a small area. But in your coloured scatter plot the first will look a very intense 5% and the latter a faint 100%. In short this whole coloured scatter idea doesn't really work, does it? – Paul Panzer Feb 13 '17 at 19:49

2 Answers2

1

One idea would be to downsample using sparse operations.

 data = data.tocsc()       # sparse operations are more efficient on csc
 N, M = data.shape
 s, t = 400, 400           # decimation factors for y and x directions
 T = sparse.csc_matrix((np.ones((M,)), np.arange(M), np.r_[np.arange(0, M, t), M]), (M, (M-1) // t + 1))
 S = sparse.csr_matrix((np.ones((N,)), np.arange(N), np.r_[np.arange(0, N, s), N]), ((N-1) // s + 1, N))
 result = S @ data @ T     # downsample by binning into s x t rectangles
 result = result.todense() # ready for plotting

This code snippet implements a simple binning, but could be refined to incorporate more sophisticated filters. The binning matrices are just binned id matrices, for example S_ij = 1 if j // s = i else 0.

Some more explanation. Since the original matrix is very large there is scope to downsample it, without any visually noticable difference in the output.

The question is how to downsample without creating a dense representation first. One possible answer is to express the binning in terms of matrix multiplication and then use sparse matrix multiplication.

So, if multiplying your original data from the right with a binning matrix T then the columns of T correspond to the column bins, in particular the number of columns of T will determine how many pixels the downsampled data will have in x direction. Each column of T determines what goes into the corresponding bin and what not. In the example I set a number of elements encoding adjacent columns (of the original matrix) to 1 and the rest to 0. This takes these columns sums across them and puts the sum in the result matrix, in other words it bins these columns together.

Multiplying from the left works in exactly the same way, only it affects rows, not columns.

If you feel that binning is too crude you can replace the simple zero one scheme for example with a smooth kernel, just make sure that the resulting matrix remains sparse. Setting up such a matrix requires a bit more effort, but is not difficult. You are using a sparse matrix for your data, so I assume you are familiar with how to construct a sparse matrix.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Thank you for the suggestion. Could you explain a bit more how the sampling works here/how I change what gets sampled? I fear I'm not quite following it yet. – P-M Feb 14 '17 at 11:53
  • I am using python2.7 (sorry that wasn't apparent from my post) so the `@` operator doesn't exist. According to this post (https://stackoverflow.com/questions/34142485/difference-between-numpy-dot-and-python-3-5-matrix-multiplication ) it is the equivalent of `np.matmul` as opposed to `np.dot`. Is that what you intended? – P-M Feb 15 '17 at 15:13
  • @P-M ugh, nasty stuff your linking there. The good news is as long as you are working with 2d arrays `np.dot(a, b)`, `np.matmul(a, b)`, `a @ b` and `a.dot(b)` all produce the same result, so take your pick. – Paul Panzer Feb 15 '17 at 15:24
  • Yup, arrived at that conclusion too. I realised though that using `np.matmul` doesn't work on sparse arrays as I need to use the appropriate scipy routine (`TypeError: Object arrays are not currently supported` otherwise) so I am using `result = S.dot(data)` followed by `result = result.dot(T)`. I am now trying to find a decent value for `s` and `t` to give me a good image. – P-M Feb 15 '17 at 15:27
0

Paul's approach is what matspy uses to make spy plots. Visually it looks like this:

matspy triple product

Matspy only cares about the sparsity pattern and not the values, but we can use its internal helper method that creates those left and right matrices:

data  # a scipy matrix
binned_shape = tuple(int(x / 3) for x in data.shape)  # example: shrink by a third

from matspy.adapters.scipy_impl import generate_spy_triple_product_coo

left, right = generate_spy_triple_product_coo(data.shape, binned_shape)

result = left @ data @ right
result = result.todense()
Adam
  • 16,808
  • 7
  • 52
  • 98