Import large .tiff file as sparse matrix

Question

I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it's better to import the file as sparse matrix, to save RAM. When I first open it as np.array and then transform it into a sparse matrix using csr_matrix(), my kernel already crashes. See code below.

from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix

ds = gdal.Open("file.tif")
band =  ds.GetRasterBand(1)
array = np.array(band.ReadAsArray())
csr_matrix(array)

Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)

Honestly, 16% sparsity is probably not worth it - a CSR matrix needs to keep two values for each non-zero. You're saving about ~65% of the memory of the full array, but losing a lot of the convenience that comes with a memory-contiguous dense array. If the values are all 0-255, you could convert the dense array to a `np.uint8` (which is ~16% of the size of the standard `np.int64`) — CJR, Mar 18 '22 at 13:51
AFAIK there is no sparse matrices in Scipy resulting in a significantly smaller memory space used. You can find the list here: https://docs.scipy.org/doc/scipy/reference/sparse.html . The only solutions I see are: working in [mapped memory](https://en.wikipedia.org/wiki/Memory-mapped_file) or operating chunk by chunk. The former is simpler but slower than the later. — Jérôme Richard, Mar 19 '22 at 17:38

hpaulj · Accepted Answer · 2022-03-19T18:59:01.273

Can you tell where the crash occurs?

band =  ds.GetRasterBand(1)
temp = band.ReadAsArray()
array = np.array(temp)    # if temp is already an array, you don't need this
csr_matrix(array)

If array is 4.4gB, (79530, 54980)

In [62]: (79530 * 54980) / 1e9
Out[62]: 4.3725594    # 4.4gB makes sense for 1 byte/element
In [63]: (79530 * 54980) * 0.16        # 16% density
Out[63]: 699609504.0                # number of nonzero values

creating csr requires doing np.nonzero(array) to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). coo format actually requires those 2 arrays plus 0.7 for the nonzero values - about 12 Gb . Converted to csr, the row attribute is reduced to 79530 elements - so about 7 Gb . (corrected for 8 bytes/element)

So at 16% density, the sparse format is, at it's best, is still larger than the dense version.

Memory error when converting matrix to sparse matrix, specified dtype is invalid

is a recent case of a memory error - which occurred in nonzero step.

Neither 0.7Gb, nor 0.7GB per array indices. You forgot to multiply the size by 4 or 8 since the indices are encoded in int32 (on Windows by default) or int64 (on Linux by default). So it is not 2.1 Gb but 6.3 GB which is too big for the OP and cause a crash due to the lack of memory. — Jérôme Richard, Mar 19 '22 at 17:28

K.Mat · Answer 2 · 2022-03-18T17:03:33.357

0

Assuming you know size of your matrix, you can create an empty sparse matrix, and then set only valid values one-by-one.

from osgeo import gdal
import numpy as np
from scipy.sparse import csr_matrix

ds = gdal.Open("file.tif")
band =  ds.GetRasterBand(1)
matrix_size = (1000, 1000) # set you size
matrix = csr_matrix(matrix_size)

# for each valid value
matrix[i, j] = your_value

Edit 1

If you don't know size of your matrix, you should be able to check it like this:

from osgeo import gdal

ds = gdal.Open("file.tif")
width  = ds.GetRasterXSize()
height = ds.GetRasterYSize()
matrix_size = (width, height)

Edit 2

I measured metrices suggested in comments (filled to the full). This is how I measured memory usage.

size 500x500

matrix	empty size	full size	filling time
csr_matrix	2856	2992	477.67 s
doc_matrix	726	35807578	3.15 s
lil_matrix	8840	8840	0.54 s

size 1000x1000

matrix	empty size	full size	filling time
csr_matrix	4856	4992	7164.94 s
doc_matrix	726	150578858	12.81 s
lil_matrix	16840	16840	2.19 s

Probably the best solution would be to use lil_matrix

edited Mar 18 '22 at 17:03

answered Mar 18 '22 at 10:57

K.Mat

1,341
11
17

Thanks but I don't know the size of my sparse matrix – justin fiedler Mar 18 '22 at 11:00
@justinfiedler you should be able to check your data size. Check my edit – K.Mat Mar 18 '22 at 11:29
2

Editing a CSR sparse matrix should be *insanely slow* as it is not meant for this usage (especially due to the low sparsity). I think it should take more RAM due to the nnz vector being resized by Scipy for each item (twice the CSR memory usage is needed) and the default datatype. Filling a 1000x1000 CSR sparse matrix took 3m30 on my machine. I expect this to take several dozens of hours for the OP input. LIL matrices are better for this but the memory usage is higher so that in the end, it will take more space and be much slower than a dense matrix... – Jérôme Richard Mar 18 '22 at 11:53
2

Every time you add a value to a CSR matrix, the entire existing matrix is copied and all the pointers are recomputed. You've taken an `O(N)` problem and made it `O(2^N)`. – CJR Mar 18 '22 at 13:38
2

@CJR Only `O(N^2)` which is already big ;) . – Jérôme Richard Mar 18 '22 at 17:51
To be clear, the `lil_matrix` might be fast compared to iteratively building a `csr_matrix`, but will use more memory than the numpy array would (at this sparsity), while being slower than just using that numpy array. You've just incorrectly measured the size of the nested list objects (which should be obvious from the fact that adding values to it doesn't change the memory usage). – CJR Mar 19 '22 at 15:11
How do you do this? `# for each valid value`? `csr_matrix` uses `np.nonzero` to find the nonzero values. But if you already have the `i, j, v` arrays for all nonzero values, why not just make the array using the `coo` style inputs, `np.csr_matrix((v, (i,j)))`. There's no need to iterate, setting one value at a time. – hpaulj Mar 19 '22 at 17:15

Import large .tiff file as sparse matrix

2 Answers2

Edit 1

Edit 2