40

On a previous question I was asking about multiprocessing, using multiple cores to make a program run faster, and someone told me this:

More often than not, you can get a 100x+ optimization with better code compared to a 4x improvement and additional complexities with multiprocessing

They then recommended that I should:

Use a profiler to understand what is slow, then focus on optimizing that.

So I went to this question: How can you profile a script?

Here I found cProfile and implemented it into some testing code to see how it works.

This is my code:

import cProfile

def foo():
    for i in range(10000):
        a = i**i
        if i % 1000 == 0:
            print(i)

cProfile.run('foo()')

However, after running it, this was what I got:

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
         1018 function calls in 20.773 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   20.773   20.773 <string>:1(<module>)
      147    0.000    0.000    0.000    0.000 rpc.py:150(debug)
       21    0.000    0.000    0.050    0.002 rpc.py:213(remotecall)
       21    0.000    0.000    0.002    0.000 rpc.py:223(asynccall)
       21    0.000    0.000    0.048    0.002 rpc.py:243(asyncreturn)
       21    0.000    0.000    0.000    0.000 rpc.py:249(decoderesponse)
       21    0.000    0.000    0.048    0.002 rpc.py:287(getresponse)
       21    0.000    0.000    0.000    0.000 rpc.py:295(_proxify)
       21    0.001    0.000    0.048    0.002 rpc.py:303(_getresponse)
       21    0.000    0.000    0.000    0.000 rpc.py:325(newseq)
       21    0.000    0.000    0.002    0.000 rpc.py:329(putmessage)
       21    0.000    0.000    0.000    0.000 rpc.py:55(dumps)
       20    0.000    0.000    0.001    0.000 rpc.py:556(__getattr__)
        1    0.000    0.000    0.001    0.001 rpc.py:574(__getmethods)
       20    0.000    0.000    0.000    0.000 rpc.py:598(__init__)
       20    0.000    0.000    0.050    0.002 rpc.py:603(__call__)
       20    0.000    0.000    0.051    0.003 run.py:340(write)
        1   20.722   20.722   20.773   20.773 test.py:3(foo)
       42    0.000    0.000    0.000    0.000 threading.py:1226(current_thread)
       21    0.000    0.000    0.000    0.000 threading.py:215(__init__)
       21    0.000    0.000    0.047    0.002 threading.py:263(wait)
       21    0.000    0.000    0.000    0.000 threading.py:74(RLock)
       21    0.000    0.000    0.000    0.000 {built-in method _struct.pack}
       21    0.000    0.000    0.000    0.000 {built-in method _thread.allocate_lock}
       42    0.000    0.000    0.000    0.000 {built-in method _thread.get_ident}
        1    0.000    0.000   20.773   20.773 {built-in method builtins.exec}
       42    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
       63    0.000    0.000    0.000    0.000 {built-in method builtins.len}
       10    0.000    0.000    0.051    0.005 {built-in method builtins.print}
       21    0.000    0.000    0.000    0.000 {built-in method select.select}
       21    0.000    0.000    0.000    0.000 {method '_acquire_restore' of '_thread.RLock' objects}
       21    0.000    0.000    0.000    0.000 {method '_is_owned' of '_thread.RLock' objects}
       21    0.000    0.000    0.000    0.000 {method '_release_save' of '_thread.RLock' objects}
       21    0.000    0.000    0.000    0.000 {method 'acquire' of '_thread.RLock' objects}
       42    0.047    0.001    0.047    0.001 {method 'acquire' of '_thread.lock' objects}
       21    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       21    0.000    0.000    0.000    0.000 {method 'dump' of '_pickle.Pickler' objects}
       20    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
       21    0.000    0.000    0.000    0.000 {method 'getvalue' of '_io.BytesIO' objects}
       21    0.000    0.000    0.000    0.000 {method 'release' of '_thread.RLock' objects}
       21    0.001    0.000    0.001    0.000 {method 'send' of '_socket.socket' objects}

I was expecting it to show me what parts of my code were taking the longest, for example for it to show that a = i**i was taking the longest to compute, however all I can gather from what it has told me is that the foo() function took the longest, however what took the longest inside that function I have no idea from the data.

Also, when I implement this into my actual code, it does the same thing. Everything is in functions, and it only tells me what functions are taking the longest rather than what in the function is taking so long.

So here are my main questions:

  1. How do I see what inside the function is making the code take so long (Should I even use cProfile?)

  2. What is the best way to set about optimizing my code once I know what is using the most CPU

Note: My RAM and disk etc is absolutely fine, it is just the CPU which is maxed out (12% CPU because it only runs on a single core)

MSeifert
  • 145,886
  • 38
  • 333
  • 352
Rlz
  • 1,649
  • 2
  • 13
  • 36
  • printing takes much longer than you think. Consider profiling functions which don't have I/O – Jean-François Fabre Aug 26 '17 at 09:10
  • 1
    @Jean-FrançoisFabre What do you mean by a profiling function which doesn't have I/O? And also it only prints 10 times, right? – Rlz Aug 26 '17 at 09:12
  • 1
    you're right. My bad – Jean-François Fabre Aug 26 '17 at 09:18
  • @MikeDunlavey Have a look at MSeifert and his answer. 99.6% of the time is taken up by the `a = i**i` – Rlz Aug 26 '17 at 13:50
  • @MikeDunlavey The point is that `x ** x` can be quite slow if you're dealing with large `x`. I mean `10000 ** 10000` is a number with 40001 digits and takes rougly 1ms (okay, I do have a slow computer) to calculate! That will be **much** slower than `print`s. Especially if the ratio of arithmetic to prints is 1000 : 1. – MSeifert Aug 26 '17 at 14:13

2 Answers2

40

How do I see what inside the function is making the code take so long (Should I even use cProfile?)

Yes, you can use cProfile but the way you asked the question makes me wonder if line_profiler (third party module, you need to install it) wouldn't be a better tool.

I'm using the IPython/Jupyter bindings of this package when I want to profile a function:

%load_ext line_profiler

To actually profile a function:

%lprun -f foo foo()
#             ^^^^^---- this call will be profiled
#         ^^^-----------function to profile

Which produces this output:

Timer unit: 5.58547e-07 s

Total time: 17.1189 s
File: <ipython-input-1-21b5a5f52f66>
Function: foo at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def foo():
     2     10001        31906      3.2      0.1      for i in range(10000):
     3     10000     30534065   3053.4     99.6          a = i**i
     4     10000        75998      7.6      0.2          if i % 1000 == 0:
     5        10         6953    695.3      0.0              print(i)

That includes several things that might be interesting. For example 99.6% of the time is spent in the i**i line.

  1. What is the best way to set about optimising my code once I know what is using the most CPU

That depends. Sometimes you need to use different functions/datastructures/algorithms - sometimes you can't do anything. But at least you know where your bottleneck is and you can estimate how much impact a change at the bottleneck or somewhere else would have.

mosc9575
  • 5,618
  • 2
  • 9
  • 32
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Amazing answer! So do I put into my code when it is calling the function %lprun -f foo foo() – Rlz Aug 26 '17 at 09:34
  • @RulerOfTheWorld Not sure what you mean. Do you mean how to profile without IPython/Jupyter? – MSeifert Aug 26 '17 at 09:35
  • 4
    `line_profiler` doesn't rely on Ipython or Jupyter. You can use the `@profile` decorator on the functions you want to profile, copy/paste the `kernprof.py` and `line_profiler.py` files into the same folder as your script and run `python kernprof.py -l -v my_script.py > output_file.txt`. This will output a text file of timings. I haven't managed to successfully do it without copy/pasting the source files around, but it's trivial to copy them over. – roganjosh Aug 26 '17 at 09:42
  • @roganjosh Thanks, sorry I missed that. Is there a way to do it so it works with any script no matter where the file is located? – Rlz Aug 26 '17 at 10:02
  • @RulerOfTheWorld There is a nice [Q+A](https://stackoverflow.com/q/22328183/5393381) here that shows how to run with and without invoking `kernprof` from the command line. :) – MSeifert Aug 26 '17 at 10:02
  • @RulerOfTheWorld I'm sure there is but I have struggled with it for some time and threw the towel in. I don't know what the issue is, I guess something is wonky with the PATH. Since I profile most of my scripts anyway (so I have lots of copies of those files), I have just settled with that approach. – roganjosh Aug 26 '17 at 10:04
  • @RulerOfTheWorld I created a gist [here](https://gist.github.com/MSeifert04/bb103bd95c57ccda88fc2709fc361579) that does the equivalent to my `lprun` command without IPython. It's a bit harder to understand because of all the wrappers ... Maybe using `kernprof` from the command line is easier. :D – MSeifert Aug 26 '17 at 10:06
  • Oh, windows. Do you have conda/anaconda? – MSeifert Aug 26 '17 at 10:13
  • Don't think so. How do I know If I have it? – Rlz Aug 26 '17 at 10:14
  • @RulerOfTheWorld It's something like a [virtual environment + package manager for Python](https://www.continuum.io/downloads). It's very handy especially on windows because you don't need to compile packages containing C or C++ yourself (which is hard to setup on windows because you need a C compiler and set it up correctly). Lots of packages (including line_profiler) can be easily installed with it. – MSeifert Aug 26 '17 at 10:16
  • @RulerOfTheWorld if you want to get any work done on Windows, your options are basically Anaconda and [unofficial binaries](http://www.lfd.uci.edu/~gohlke/pythonlibs/). NumPy is super tough to get working properly without these. I did manage to get line_profiler installed via pip IIRC but I just couldn't use it the way I should have been able. This is where the copy/paste hack came from – roganjosh Aug 26 '17 at 10:19
  • Is there any way to install it without anaconda/conda? Am I doing something wrong or is it not possible? – Rlz Aug 26 '17 at 10:19
  • 1
    @RulerOfTheWorld at this point I think you're at the stage of another question. I've given you a workable solution for Windows, it involves copy/pasting 2 files. If that is not suitable then you open the door to more issues. – roganjosh Aug 26 '17 at 10:26
  • @roganjosh ok thanks. I will use your workable solution for now. If I need anything more I will ask a new question. Thanks for all your help – Rlz Aug 26 '17 at 10:28
  • @roganjosh Where do I type `python kernprof.py -l -v my_script.py > output_file.txt` – Rlz Aug 26 '17 at 10:34
  • @RulerOfTheWorld Hold down shift and right click anywhere in the folder that contains all your scripts. Click "open command window here" and then type it in there. Did you decorate your functions with `@profile`? – roganjosh Aug 26 '17 at 10:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/152892/discussion-between-ruler-of-the-world-and-roganjosh). – Rlz Aug 26 '17 at 10:36
2

As you have noticed in the profiling log, the cProfile maximum resolution is the function.

So:

  • if your function is small you may be able to figure out which part is taking a long time (although it's sometimes difficult with built-in calls like in)
  • if your function is big, maybe it's time to cut it down to smaller functions, which become "profileable", however function call overhead may slow things down
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219