A few days back I answered a question on SO regarding reading a tar file in parallel.
This was the gist of the question:
import bz2
import tarfile
from multiprocessing import Pool
tr = tarfile.open('data.tar')
def clean_file(tar_file_entry):
if '.bz2' not in str(tar_file_entry):
return
with tr.extractfile(tar_file_entry) as bz2_file:
with bz2.open(bz2_file, "rt") as bzinput:
# Reading bz2 file
....
....
def process_serial():
members = tr.getmembers()
processed_files = []
for i, member in enumerate(members):
processed_files.append(clean_file(member))
print(f'done {i}/{len(members)}')
def process_parallel():
members = tr.getmembers()
with Pool() as pool:
processed_files = pool.map(clean_file, members)
print(processed_files)
def main():
process_serial() # No error
process_parallel() # Error
if __name__ == '__main__':
main()
We were able to make the error disappear by just opening the tar file inside the child process rather than in the parent, as mentioned in the answer.
I am not able to understand why did this work.
Even if we open the tarfile in the parent process, the child process will get a new copy. So why does opening the tarfile in the child process explicitly make any difference?
Does this mean that in the first case, the child processes were somehow mutating the common tarfile object and causing memory corruption due to concurrent writes?