Creating large sparse matrix from existing data

Question

Main part of my issue was answered here.

However, my problem lies in fact that I have a large amount of data - two vectors of 265k (265 000) entries. I tried converting these arrays into sparse scipy matrices but the subtraction doesn't work that way because scipy cannot use numpy's broadcasting.

array_one = sparse.csr_matrix(df.id.values)
array_two = sparse.csr_matrix(df.id.values[:, None])
array_one - array_two throws MemoryError

Currently, I am looping over a 265k² entries and while it's working it's magic I'd like to ask for a more efficient solution to this for the future.

I'm doing this:

for j in range(265180):
    for i in range(265180):
        if arr1.data[j] == arr2.data[i]:
            rows.append(j)
            cols.append(i)
            data_items.append(1)
mat = sparse.coo_matrix(rows, cols, data_items)

Is there any way to convert Nx1 - 1xN numpy calculated matrix into NxN sparse matrix when N is large enough so that calculated matrix doesn't fit into the memory?

Is the number of distinct values in the input vectors limited in any way? (For example, if the data type of the input vectors is 8 bit, then there are at most 256 distinct values in each array.) — Warren Weckesser, Nov 24 '17 at 16:31
They are floats. But I suppose I could round them up to 2 decimals or something similar. That would be 101 distinct values. Are you implying to convert the type to byte? — bmakan, Nov 25 '17 at 20:51
No, I was just thinking about methods that might take advantage of a small number of distinct values. By the way, you might find some good ideas in this question and its answers: https://stackoverflow.com/questions/47475611/fastest-coincidence-matrix (ping @Divakar). — Warren Weckesser, Nov 25 '17 at 21:14

Creating large sparse matrix from existing data

0 Answers0