I am using the following code snippet to parallelize a process (creation of embeddings for a neural network):
#...
num_processes = max(1, cpu_count() // 2) # Limit the number of processes to half of available CPU cores or minimum 1
def process_mol(index):
# Get molecule from the original list using the index
obmol = obmol_list[index]
result = mol2vec(obmol)
return result
if __name__ == '__main__':
print("computing embeddings...")
# Create a pool of worker processes
pool = Pool(processes=num_processes)
# Create a list of indices corresponding to the positions of the molecules in obmol_list
indices = range(len(obmol_list))
# Parallelize the mol2vec function call across the OBmols in 'obmol_list'
data_list = pool.map(process_mol, indices)
# Close the pool to free up resources
pool.close()
pool.join()
# pickl data
with open('pickled_data/data_list.pkl', 'wb') as f:
pickle.dump(data_list, f)
#...
However I am getting the following error:
multiprocessing.pool.MaybeEncodingError: Error sending result: '[Data(x=[16, 396],... Reason: 'OSError(24, 'Too many open files')'
I am already using only half of the available CPU cores, but my data set (1.6 million elements) is rather large. Any ideas on how to solve this problem?