2

I have a python script that has a high runtime (4 days when tested a single script on single virtual-core machine) which works on some list of files based on its input arguments that we provide.

The list of arguments that I wanted to test is very long and running each of them sequentially is not feasible due to the high cost of the infrastructure.

So I tried running the script independently with different arguments on my 12 core machine, e.g.

nohup python script.py 1 &
nohup python script.py 2 &

and such 8 more times.. thinking that each process will be allocated to each core independently and 2 core will be in standby, as there is no overlap in the files on which the scripts would be working, any race condition or deadlock wouldn't occur as the argument that we are passing to all the scripts is different, hence no problem with GIL.

What I have observed is that the individual python scripts are not running at similar pace or following the previously noted timeline, all the ten distinct processes should finish within 4 days but its been like 14 days but only some threads have completed and that too in just last 1-2 days. Rest of the processes are lagging behind which I could see from the log file which is generated.

Can someone please help me understand this behaviour for python??

lorenzofeliz
  • 597
  • 6
  • 11
  • "all the ten distinct processes should finish within 4 days" Are you sure about that? Did you run them seperately? Do they depend on some common resource that may limit the processing speed? –  Jun 02 '16 at 07:16
  • Yes i know the list of files on which they depend upon, and all of them are distinct. There is no common file/resource between them. – lorenzofeliz Jun 02 '16 at 07:27
  • 2
    The processor is the bottleneck only in poorly written program, or when you have heavy computations (meteorological or oceanographic models). If launching multiple processes on a multicore machine does not increase much speed, you should wonder whether your processing could be IO or memory bound instead of processor bound – Serge Ballesta Jun 02 '16 at 07:45
  • I understand that, and the code is well written so that there wont be any deadlocks or overheads. The script does has IO operations but is not heavy on the memory. The csv it reads is like 160-200 kb and has like 20K rows at max.At a given time it reads these files stores the result in memory and move on to another file and in all only 24-29 files are read by any script. The results generated are also not that large just about 200kb file at max. Computation is done on those rows which are read into the memory previously and it alone takes large chunk of time. – lorenzofeliz Jun 02 '16 at 10:26
  • How many files? What type of operations you execute can you provide more details? – gogasca Jun 03 '16 at 08:53
  • I am extremely sorry, i had reported the files sizes wrong. The CSVs are read in a pair of two, one is of 174-180 mb (50 mb or so when gzip compressed) and another is 1 mb (100 kb or so when gzip compressed). In 30-24 CSV file pairs are read by a script depending on the input argument and at a given time, only a single file is read, also while reading it only a part of a file is read into the memory and processed , once processing is done we move to next part. – lorenzofeliz Jun 03 '16 at 12:35

1 Answers1

0

I would suggest first trying to describe and understand more about your program and its performance:

You can start with Python line profiler, it is used to see how fast and how often each line of code is running in your script.

pip install line_profiler

Once you time it, you can start analyzing memory

pip install -U memory_profiler
pip install psutil

The quickest way to find “memory leaks” is to use an awesome tool called objgraph This tool allows you to see the number of objects in memory and also locate all the different places in your code that hold references to these objects.

pip install objgraph

Once you understand each part of the script, can you describe more about the nature of your code? Also please read this extract from a previous post:

"In order to take advantage of a multicore (or multiprocessor) computer, you need a program written in such a way that it can be run in parallel, and a runtime that will allow the program to actually be executed in parallel on multiple cores (and operating system, although any operating system you can run on your PC will do this). This is really parallel programming, although there are different approaches to parallel programming. The ones that are relevant to Python are multiprocessing and multithreading."

Does python support multiprocessor/multicore programming?

Community
  • 1
  • 1
gogasca
  • 9,283
  • 6
  • 80
  • 125