Main part of my issue was answered here.
However, my problem lies in fact that I have a large amount of data - two vectors of 265k (265 000) entries. I tried converting these arrays into sparse scipy matrices but the subtraction doesn't work that way because scipy cannot use numpy's broadcasting.
array_one = sparse.csr_matrix(df.id.values)
array_two = sparse.csr_matrix(df.id.values[:, None])
array_one - array_two throws MemoryError
Currently, I am looping over a 265k2 entries and while it's working it's magic I'd like to ask for a more efficient solution to this for the future.
I'm doing this:
for j in range(265180):
for i in range(265180):
if arr1.data[j] == arr2.data[i]:
rows.append(j)
cols.append(i)
data_items.append(1)
mat = sparse.coo_matrix(rows, cols, data_items)
Is there any way to convert Nx1 - 1xN numpy calculated matrix into NxN sparse matrix when N is large enough so that calculated matrix doesn't fit into the memory?