I have a simple Jaccard like similarity calculation on a list of n-gram sets as shown below. This code executes fine using a relatively smaller list, yet memory usage starts becoming a concern when the list size get larger, say 10k or more. Instead of appending results in a list, if I can populate a Numpy zero-allocated array within the similarity calculation function, it seems this memory issue could be eliminated.
What would be the best way to do get results in an array? Thanks.
def myfunc3(idx):
mylist = []
for jj in range(idx+1,len_lines):
aa = setng[idx].intersection(setng[jj])
bb = len(aa) / max(lg[idx],lg[jj])
mylist.append(bb)
return mylist
if __name__ == '__main__':
pool = Pool(processes=4)
myres = pool.map(myfunc3, range(len(lines)-1))
pool.close()