0

I have some trouble finding a strategy to parallelize this process that is slow to complete. I have to parse some data and calculate scores between all pairwise entries in a list. Since the list is very big (>1E6), I have to store all the scores in scipy.sparse.csr_matrix((values, (row,col)) where values are a list of all non-zero scores, row and col are row-wise/col-wise indices corresponding to their respective 2D array position. (I use scipy.parse.csr_matrix because a numpy 2D array would not fit the memory)

The main issue is how to parallelize this process, adjusting for row and col indices between input splitting and output joining at the end of the process.

Here's my current code. I would appreciate any hint/direction to move one.

from scipy.sparse import csr_matrix

def heavy_calculation(arg1, arg2):
    # some calculations here
    return value

def generateCsrMatrix(allArgs):
    row = []
    col = []
    dat = []
    for ind1 in range(len(allArgs)-1):
        for ind2 in range(ind1+1, len(allArgs):
          value = heavy_calculation(allArgs[ind1], allArgs[ind2])
          if value > 0:
              row.append(ind1)
              col.append(ind2)
              dat.append(value)
    # End nested loop
    dat2 = csr_matrix((dat, (row, col), shape=(len(allArgs),len(allArgs)))
    return dat2

My strategy would be to generate all pairwise indices as tuples in a list and multithread with it, however this approach adds a significant memory overhead.

[(ind1, ind2), (ind1, ind3), ...]
Kleenex
  • 75
  • 1
  • 2
  • 7
  • I see no attempt at parallelization here... – juanpa.arrivillaga Aug 30 '17 at 17:33
  • Sorry i forgot to add it. Question edited – Kleenex Aug 30 '17 at 17:37
  • Again, I don't see an attempt really, much more like a vague description. Whe you say "multithread with it", do you mean you are using the `threading` library? On CPython, you cannot use multiple threads because of the [global interpreter lock](https://stackoverflow.com/questions/1294382/what-is-a-global-interpreter-lock-gil), aka "the GIL". You have to use `multiprocessing`, unless you are working with some IO task, where the CPython interpreter releases the GIL. – juanpa.arrivillaga Aug 30 '17 at 18:11

0 Answers0