0

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?

Here is my multithreaded code:

  def parse():
    thread_list = []

    for file_name in cve_file:
        t = CVEParser(file_name)
        t.start()
        thread_list.append(t)

    for t in thread_list:
        t.join()
        result = t.result
        for res in result:
            print res
            PersistenceService.insert_data_from_file(res[0], res[1])
            os.remove(res[0])

and thats the "faster" code:

def parse:
thread_list = []

for file_name in cve_file:
    t = CVEParser(file_name)
    t.start()
    t.join()
    thread_list.append(t)

for t in thread_list:
    result = t.result
    for res in result:
        print res
        PersistenceService.insert_data_from_file(res[0], res[1])
        os.remove(res[0])

The sequential code is faster by 10 whole minutes, how is this possible?

DSM
  • 342,061
  • 65
  • 592
  • 494
Jack_90
  • 25
  • 4

3 Answers3

1

Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL

Here's a SO question on that topic

Community
  • 1
  • 1
illright
  • 3,991
  • 2
  • 29
  • 54
  • +1 :D I'll just add that regular Python threads are meant to parallelize IO operations and are very good at it. Also, some other system calls can be efficiently parallelized. Otherwise, you use them for timers and loops that don't do a lot of work. When you use them to do something like parsing, searching, sorting etc. GIL is acquired and released too many times and thrashing occurs while threads try to beat eachother over GIL, instead of doing their jobs. Some libraries can release the GIL and perform heavy work (e.g. numpy). Using them in a thread works just fine. – Dalen Nov 20 '16 at 15:43
  • While multiprocessing is a solution, it has slowdowns at other points of execution. Creating new processes is a little bit slower than creating new threads. Then passing results back to main process is using pickling and IO which also introduces some latency. If you have something very big to parse, write it in C or Cython (cython can release the GIL using with statement). Or use a Python distribution that doesn't use CPython virtual machine - like Jython, IronPython etc. Their VMs do not use GIL for safe memory management. – Dalen Nov 20 '16 at 16:12
  • Also, you can try searching for a library that would allow you to call threaded Python function with released GIL. I am not sure that there is any. (yet) Perhaps you will write US one. – Dalen Nov 20 '16 at 16:13
0

Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.

  • if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
  • if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
  • only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations

In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.

This kind of code is often more efficient because it allows processing as soon as one thread has finished:

active_list = thread_list[:]
while len(active_list) > 0:
    for t in active_list:
        if not t.is_active():
            t.join()
            active_list.remove[t]
            # process t results
            ...
    time.sleep(0.1)

(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.

The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.

If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87