1

Main part of my issue was answered here.

However, my problem lies in fact that I have a large amount of data - two vectors of 265k (265 000) entries. I tried converting these arrays into sparse scipy matrices but the subtraction doesn't work that way because scipy cannot use numpy's broadcasting.

array_one = sparse.csr_matrix(df.id.values)
array_two = sparse.csr_matrix(df.id.values[:, None])
array_one - array_two throws MemoryError

Currently, I am looping over a 265k2 entries and while it's working it's magic I'd like to ask for a more efficient solution to this for the future.

I'm doing this:

for j in range(265180):
    for i in range(265180):
        if arr1.data[j] == arr2.data[i]:
            rows.append(j)
            cols.append(i)
            data_items.append(1)
mat = sparse.coo_matrix(rows, cols, data_items)

Is there any way to convert Nx1 - 1xN numpy calculated matrix into NxN sparse matrix when N is large enough so that calculated matrix doesn't fit into the memory?

bmakan
  • 332
  • 3
  • 11
  • Is the number of distinct values in the input vectors limited in any way? (For example, if the data type of the input vectors is 8 bit, then there are at most 256 distinct values in each array.) – Warren Weckesser Nov 24 '17 at 16:31
  • They are floats. But I suppose I could round them up to 2 decimals or something similar. That would be 101 distinct values. Are you implying to convert the type to byte? – bmakan Nov 25 '17 at 20:51
  • No, I was just thinking about methods that might take advantage of a small number of distinct values. By the way, you might find some good ideas in this question and its answers: https://stackoverflow.com/questions/47475611/fastest-coincidence-matrix (ping @Divakar). – Warren Weckesser Nov 25 '17 at 21:14

0 Answers0