0

Suppose i have a table with 100000 rows and a python script which performs some operations on each row of this table sequentially. Now to speed up this process should I create 10 separate scripts and run them simultaneously that process subsequent 10000 rows of the table or should I create 10 threads to process rows for better execution speed ?

user7845429
  • 103
  • 1
  • 8
  • Have a look at https://docs.python.org/3/library/concurrent.futures.html. It sounds like you want `map` from either a ProcessPoolExecutor or ThreadPoolExecutor. Which one you want depends on the nature of the operation you're mapping (CPU bound vs IO bound respectively). – Chris Feb 20 '19 at 14:03

2 Answers2

1

Threading

  • Due to the Global Interpreter Lock, python threads are not truly parallel. In other words only a single thread can be running at a time.
  • If you are performing CPU bound tasks then dividing the workload amongst threads will not speed up your computations. If anything it will slow them down because there are more threads for the interpreter to switch between.
  • Threading is much more useful for IO bound tasks. For example if you are communicating with a number of different clients/servers at the same time. In this case you can switch between threads while you are waiting for different clients/servers to respond

Multiprocessing

  • As Eman Hamed has pointed out, it can be difficult to share objects while multiprocessing.

Vectorization

  • Libraries like pandas allow you to use vectorized methods on tables. These are highly optimized operations written in C that execute very fast on an entire table or column. Depending on the structure of your table and the operations that you want to perform, you should consider taking advantage of this
Community
  • 1
  • 1
Arran Duff
  • 1,214
  • 2
  • 11
  • 23
  • 1
    FWIW: Python threads do run [concurrently](https://en.wikipedia.org/wiki/Concurrent_computing). That is to say, more than one thread can be "in progress" at the same time. Due to the GIL, they can't achieve true [parallelism](https://en.wikipedia.org/wiki/Parallel_computing). I.e., two or more different CPUs _simultaneously_ executing instructions on behalf of two or more different threads. – Solomon Slow Feb 20 '19 at 14:13
  • @SolomonSlow I'm not sure I understand what you mean when you say that more than one thread can be **in progress** at the same time? The GIL is a lock that must be acquired for a thread to run. Since only one thread can acquire the lock at any one time - it is my understanding that only one thread can run at any one time.... – Arran Duff Feb 20 '19 at 18:42
  • By, "in progress," I meant that the thread has been started, but it has not yet finished. It may or it may not be running, but either way it has a _context_, which consists (at least) of a stack of unfinished function calls (including all of their arguments and local variables), and an instruction pointer that tells what statement the thread will execute next. The GIL allows only one Python thread to run at any given instant in time, but any number of threads can be started-but-not-yet-finished, and that's what computer scientists usually define "concurrently" to mean. – Solomon Slow Feb 20 '19 at 19:34
  • 1
    It's like how, you can tell your friends or neighbors that, "I'm remodeling my kitchen," when in fact, at that exact moment, you actually are working out at the gym. The remodel can still be "in progress" even though you actually are _doing_ something else. – Solomon Slow Feb 20 '19 at 19:39
  • @SolomonSlow thanks for the explanation. I hadn't realised the difference between concurrent and parallel. Good to know. I've corrected my answer – Arran Duff Feb 20 '19 at 21:36
0

Process threads have in common a continouous(virtual) memory block known as heap processes don't. Threads also consume less OS resources relative to whole processes(seperate scripts) and there is no context switching happening.

The single biggest performance factor in multithreaded execution when there no locking/barriers involved is data access locality eg. matrix multiplication kernels.

Suppose data is stored in heap in a linear fashion ie. 0-th row in [0-4095]bytes, 1st row in[4096-8191]bytes, etc. Then thread-0 should operate in 0,10,20, ... rows, thread-1 operate in 1,11,21,... rows, etc.

The main idea is to have a set of 4K pages kept in physical RAM and 64byte blocks kept in L3 cache and operate on them repeatedly. Computers usually assume that if you 'use' a particular memory location then you're also gonna use adjacent ones, and you should do your best to do so in your program. The worst case scenario is accessing memory locations that are like ~10MiB apart in a random fashion so don't do that. Eg. If a single row is 1310720 doubles(64B) in size, then your threads should operate in a intra-row(single row) rather inter-row(above) fashion.

Benchmark your code and depending on your results, if your algorithm can process around 21.3GiB/s(DDR3-2666Mhz) of rows then you have a memory-bound task. If your code is like 1GiB/s processing speed, then you have a compute-bound task meaning executing instructions on data takes more time than fetching data from RAM and you need to either optimize your code or reach higher IPC by utilizing AVXx instructions sets or buy a newer processesor with more cores or higher frequency.

FranG
  • 116
  • 6