0

I have a simple Jaccard like similarity calculation on a list of n-gram sets as shown below. This code executes fine using a relatively smaller list, yet memory usage starts becoming a concern when the list size get larger, say 10k or more. Instead of appending results in a list, if I can populate a Numpy zero-allocated array within the similarity calculation function, it seems this memory issue could be eliminated.

What would be the best way to do get results in an array? Thanks.

def myfunc3(idx):
    mylist = []
    for jj in range(idx+1,len_lines):
        aa = setng[idx].intersection(setng[jj])
        bb = len(aa) / max(lg[idx],lg[jj])
        mylist.append(bb)
    return mylist

if __name__ == '__main__':
    pool = Pool(processes=4)
    myres = pool.map(myfunc3, range(len(lines)-1))
    pool.close()
Gökhan Sever
  • 8,004
  • 13
  • 36
  • 38
  • why do you think a numpy array will take less memory than a list? – maxymoo May 03 '16 at 22:43
  • @maxymoo Typically because Python numbers are objects while Numpy values are stored as C-style arrays (i.e. contiguous memory). For example, the size of an int in python is 24 bytes, while an int in Numpy/C might only be 8 bytes tops. – Snakes and Coffee May 03 '16 at 23:00
  • myres is a list of lists. This becomes expensive if there are too many combinations to compute. getsizeof from sys might help to check size. If scaled properly I think even a numpy.int8 which is 1 byte would do the work for me. Even with this scaling my final array will be ~ 2GB, then the symmetry matrix will be double this size. – Gökhan Sever May 03 '16 at 23:11
  • does this question help you? http://stackoverflow.com/questions/9964809/numpy-vs-multiprocessing-and-mmap?rq=1 – maxymoo May 03 '16 at 23:40
  • @maxymoo do I really need a memmapped array? I am guessing the solution requires a globally shared array, yet unsure how to do this. – Gökhan Sever May 04 '16 at 13:30

1 Answers1

0

Here is a self-solution, which is not the most efficient one, yet does the job done. I used ideas from a few related SO questions. The key here is imap which lowers the memory consumption and also running the code on Linux.

if __name__ == '__main__':
    m = len(lines)
    dma = np.zeros(m * (m - 1) // 2, dtype=np.int8)
    pool = Pool(processes=4)
    idx = 0
    for resultset in pool.imap(myfunc3, range(len(lines)-1)):
        lenres = len(resultset)
        dma[idx:lenres+idx] = resultset
        idx += lenres
    pool.close()

This populates dma array as it was originally intended. I am sure this could be further improved by directly writing the result to the array in the function, rather than collecting results in a list.

Gökhan Sever
  • 8,004
  • 13
  • 36
  • 38