I'm processing a huge file using a multiprocessing.pool
of processes that are all writing to one output file. I divide the input file into partitions (essentially 2-tuples of line indices that I later pass to linecache.getline()
) and use pool.map(func, list_of_partitions)
. Inside func
, the current process works on its given partition (which is guaranteed not to overlap with any other partition), then before it writes the results to the single output file it acquires a lock, then releases it after writing is done. The lock is created using an initializer so that it's inherited (taken from this answer) in order to be shared across processes. Here's the relevant code:
l = multiprocessing.Lock() # lock
o = open("filename", 'w') # output
pool = multiprocessing.Pool(num_cores, initializer=init, initargs=(l, o,))
where init
is defined as such:
def init(l, o):
global lock, output
lock = l
output = o
My problem is that the output file is missing some text in random locations. At first, I found that the output files were cut off at the end, but I confirmed that it's not exclusive to the end of the file when I added many empty lines at the end of the file and found another block of text in the middle of the file also missing parts. Here is an example of an expected block of text:
Pairs: [(81266, 146942, 5)]
Number of pairs: 1
idx1 range: [81266, 81266]
idx2 range: [146942, 146942]
Most similar pair: (81266, 146942, 5)
Total similarity (mass): 5
And here's a block that is cut off:
Pairs: [(81267, 200604, 5)]
Number of pairs: 1
idx1 range: [81267, 81267]
idx2 range: [200604, 200604]
Most similar pair: (81267, 200604, 5)
Total similarity (ma
And another more severe case:
Pairs: [(359543, 366898, 5), (359544, 366898, 5), (359544, 366899, 6)]
Number of pairs: 3
For what it's worth, I'm doing pool.close()
then pool.join()
, although the problem persisted when I removed them.
Can you think of things that would cause this? The problem doesn't occur when I run the code normally with no parallelism. I compared the number of valid, full output text blocks (like the example I gave above) in the file produced by the parallel version to the one produced by the non-parallel version, given the same input file, and found that the parallel version had 137,073 valid blocks while the non-parallel one had 137,114 valid blocks, so I lost 41 valid blocks (i.e. 41 different blocks were cut off), which is a very small number compared to the total number of blocks, so this is really baffling me. Any ideas or suggestions are greatly appreciated!