7

I'm writing a multithreaded decompressor in python. Each thread needs to access a different chunk of the input file.

Note 1: it's not possible to load the whole file, as it ranges from 15 Gb to 200 Gb; I'm not using multithreading to speed up data read, but data decompression, I just want to make sure data read does not slow down decompression.

Note 2: the GIL is not a problem, here, as the main decompressor function is a C extension and it calls Py_ALLOW_THREADS, so that the GIL is released while decompressing. The second stage decompression uses numpy which is also GIL-free.

1) I assumed it would NOT work to simply share a Decompressor object (which basically wraps a file object), since if thread A calls the following:

decompressor.seek(x)
decompressor.read(1024)

and thread B does the same, thread A might end up reading from thread B offset. Is this correct?

2) Right now I'm simply making every thread create its own Decompressor instance and it seems to work, but I'm not sure it is the best approach. I considered these possibilities:

  • Add something like

    seekandread(from_where, length)
    

    to the Decompressor class which acquires a lock, seeks, reads and releases the lock;

  • Create a thread which waits for read requests and executes them in the correct order.

So, am I missing an obvious solution? Is there a significant performance difference between these methods?

Thanks

Alberto
  • 194
  • 1
  • 10
  • 5
    Reading a file in a multithreaded fashion actually slows down the process when you have a hard drive. The needle has to jump from one place to another instead of working in an iterative fashion. You should load the file before processing it. – the_drow Feb 07 '12 at 12:16
  • It's not possible to load the whole file, as it ranges from 15 Gb to 200 Gb; I'm not using multithreading to speed up data read, but data decompression, I just wanted to make sure data read does not slow down decompression. – Alberto Feb 07 '12 at 12:21
  • Of course this may or may not apply to SSDs. I have no clue about the subject though. You shouldn't relay on hardware to do it. Once SSDs are common enough performing I/O in a multithreaded fashion might be efficient. – the_drow Feb 07 '12 at 12:21
  • Could you please clarify that in your question? – the_drow Feb 07 '12 at 12:22
  • Ok, I have addded a note – Alberto Feb 07 '12 at 12:30

3 Answers3

2

You might want to use the Leader/Follower pattern if you are not doing that already.
The Leader thread will know which segments are already being handled and which are not and will assign itself with the next unprocessed segment and then become a follower, leaving the leadership to the next available thread in the pool.

the_drow
  • 18,571
  • 25
  • 126
  • 193
2

You could use mmap. See mmap() vs. reading blocks

As Tim Cooper notes, mmap is a good idea when you have random access (multiple threads would make it seem like you have this), and they would be able to share the same physical pages.

Community
  • 1
  • 1
Janus Troelsen
  • 20,267
  • 14
  • 135
  • 196
  • This seems excellent! I looked at the python documentation for mmap, but I could not find a reference about thread-safety. If 2 threads do something like a=mappedfile[x:y] at the same time, is it going to work as expected? – Alberto Feb 07 '12 at 13:47
  • To answer myself, it appears that python mmap slice notation is actually threadsafe. I created a test program that accesses different parts of a mmapped file from different threads, and checks the result. It passes the test if I use slice notation, it fails if I use seek/read. I still have to check the performance, though. – Alberto Feb 07 '12 at 16:30
  • 1
    @Alberto: It sounds to me that any given segment that is already being processed should be protected by at least a mutex, if not a throwing conditional semaphore. By a throwing conditional semaphore I mean a semaphore that does not wait if the pre-entry condition is not met until it is but instead throws an exception. It's an hybrid between a semaphore and a gaurd. You might want to throw only when condition B is not met and wait if condition A is met. – the_drow Feb 07 '12 at 19:33
  • 1
    @Alberto: You need a locking mechanism that will split the segment into two segments when opening a new thread on a segment. E.g. a thread reads from 0 - 1024 and a new thread is created and is assigned to 0 - 1024 as well. The first thread already processed the first 100 bytes so you can split up the work to the second thread. Only use this if you really need to optimize. I can leave another answer with a more detailed algorithm if you wish. – the_drow Feb 07 '12 at 19:37
  • @the_drow: I'm not sure I have understood: do you mean that locking would be necessary only if two threads were accessing the same range, and is otherwise not needed? If so, there is no problem, as the code is written in such a way that it is impossible for two threads to request the same data: each thread is assigned an unique list of ranges to process. Thanks – Alberto Feb 07 '12 at 20:50
  • 1
    @Alberto: Yes I do mean it. But I also mean that two threads can be assigned to the same data, they will just be assigned only to a part of it. – the_drow Feb 08 '12 at 06:34
  • Ok, I implemented the two possibilities and with mmap performance is more or less the same, but with the added bonus of a single file object. I think it is the best solution. – Alberto Feb 08 '12 at 12:19
1

CPython has GIL so multiple threads do not increase performance for CPU-bound tasks.

If the problem is not IO-bound (disk provides/stores data faster than CPU decompresses it) you could use multiprocessing module: each process opens the file and decompresses a given byte range.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 1
    The main decompressor function is a C extension and it calls Py_ALLOW_THREADS, so that the GIL is released while decompressing. The second stage decompression uses numpy which is also gil-free. I've already measured a good speedup. – Alberto Feb 07 '12 at 13:32
  • (maybe this clarification - about you had "taken care" of GIL - could also go into the question main body) – jsbueno Feb 07 '12 at 15:43