I'm writing a multithreaded decompressor in python. Each thread needs to access a different chunk of the input file.
Note 1: it's not possible to load the whole file, as it ranges from 15 Gb to 200 Gb; I'm not using multithreading to speed up data read, but data decompression, I just want to make sure data read does not slow down decompression.
Note 2: the GIL is not a problem, here, as the main decompressor function is a C extension and it calls Py_ALLOW_THREADS, so that the GIL is released while decompressing. The second stage decompression uses numpy which is also GIL-free.
1) I assumed it would NOT work to simply share a Decompressor object (which basically wraps a file object), since if thread A calls the following:
decompressor.seek(x)
decompressor.read(1024)
and thread B does the same, thread A might end up reading from thread B offset. Is this correct?
2) Right now I'm simply making every thread create its own Decompressor instance and it seems to work, but I'm not sure it is the best approach. I considered these possibilities:
Add something like
seekandread(from_where, length)
to the Decompressor class which acquires a lock, seeks, reads and releases the lock;
Create a thread which waits for read requests and executes them in the correct order.
So, am I missing an obvious solution? Is there a significant performance difference between these methods?
Thanks