5

I'm trying to find out why the following very simple, minimal example takes ~33% of CPU usage on my i7-5500U CPU, Windows 10 computer:

import time
import numpy as np
import scipy.linalg
import cProfile

class CPUTest:
    def __init__(self):
        self.running = True

    def compute_stuff(self):
        dims = 150
        A = np.random.random((dims, dims))
        B = scipy.linalg.inv(np.dot(A.T, A))

    def run(self):
        prev_time = time.time()
        start_time = prev_time
        while self.running:
            time.sleep(0.3)
            st = time.time()
            self.compute_stuff()
            et = time.time()
            print 'Time for the whole iteration, inc. sleep: %.3f (ms), whereas the processing segment took %.3f (ms): ' % ((st - prev_time) * 1000, (et - st) * 1000)
            prev_time = st
            if st - start_time > 10.0:
                break

t = CPUTest()
t.run()
# cProfile.run('t.run()')

The compute_stuff function takes only 2ms, the rest of the time the program is sleeping. As the sleep should not use CPU, this program should, in theory, run with only a 0.6% CPU usage, but it is currently taking around 30%.

I've tried a profiler, which confirms that the program is in sleep 9.79 seconds out of the 10.

Can someone please provide a hint on why is python behaving this way? What would be an alternative to reduce the CPU usage.

Thanks a lot!

Edit

In summary, the program is asleep >97% of the time, and I still get a CPU usage of 33%. I would like to reduce the CPU usage without sacrificing the frequency for computations.

Here you can find an example of the program output:

Time for the whole iteration, inc. sleep: 302.000 (ms), whereas the processing segment took 1.000 (ms):
Time for the whole iteration, inc. sleep: 301.000 (ms), whereas the processing segment took 2.000 (ms):
Time for the whole iteration, inc. sleep: 303.000 (ms), whereas the processing segment took 3.000 (ms):
Time for the whole iteration, inc. sleep: 303.000 (ms), whereas the processing segment took 2.000 (ms):
Time for the whole iteration, inc. sleep: 302.000 (ms), whereas the processing segment took 1.000 (ms):
Time for the whole iteration, inc. sleep: 302.000 (ms), whereas the processing segment took 2.000 (ms):
Time for the whole iteration, inc. sleep: 302.000 (ms), whereas the processing segment took 2.000 (ms):
Time for the whole iteration, inc. sleep: 303.000 (ms), whereas the processing segment took 1.000 (ms):
Time for the whole iteration, inc. sleep: 301.000 (ms), whereas the processing segment took 2.000 (ms):
Time for the whole iteration, inc. sleep: 303.000 (ms), whereas the processing segment took 1.000 (ms):  

and this is the output from the profiler:

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000   10.050   10.050 <string>:1(<module>)
    1    0.019    0.019    0.021    0.021 __init__.py:133(<module>)
    1    0.067    0.067    0.119    0.119 __init__.py:205(<module>)
    1    0.000    0.000    0.000    0.000 _components.py:1(<module>)
    1    0.000    0.000    0.000    0.000 _laplacian.py:3(<module>)
   49    0.000    0.000    0.000    0.000 _methods.py:37(_any)
   49    0.000    0.000    0.001    0.000 _methods.py:40(_all)
   49    0.011    0.000    0.137    0.003 _util.py:141(_asarray_validated)
    1    0.001    0.001    0.001    0.001 _validation.py:1(<module>)
    1    0.000    0.000    0.000    0.000 _version.py:114(_compare)
    1    0.000    0.000    0.000    0.000 _version.py:148(__gt__)
    2    0.000    0.000    0.000    0.000 _version.py:55(__init__)
    1    0.000    0.000    0.000    0.000 _version.py:78(_compare_version)
    1    0.008    0.008    0.009    0.009 base.py:1(<module>)
    1    0.000    0.000    0.000    0.000 base.py:15(SparseWarning)
    1    0.000    0.000    0.000    0.000 base.py:19(SparseFormatWarning)
    1    0.000    0.000    0.000    0.000 base.py:23(SparseEfficiencyWarning)
    1    0.000    0.000    0.000    0.000 base.py:61(spmatrix)
   49    0.000    0.000    0.000    0.000 base.py:887(isspmatrix)
   49    0.043    0.001    0.185    0.004 basic.py:619(inv)
   49    0.000    0.000    0.001    0.000 blas.py:177(find_best_blas_type)
   49    0.001    0.000    0.002    0.000 blas.py:223(_get_funcs)
    1    0.000    0.000    0.000    0.000 bsr.py:1(<module>)
    1    0.000    0.000    0.000    0.000 bsr.py:22(bsr_matrix)
    1    0.012    0.012    0.012    0.012 compressed.py:1(<module>)
    1    0.000    0.000    0.000    0.000 compressed.py:21(_cs_matrix)
    1    0.000    0.000    0.000    0.000 construct.py:2(<module>)
    1    0.000    0.000    0.000    0.000 coo.py:1(<module>)
    1    0.000    0.000    0.000    0.000 coo.py:21(coo_matrix)
   49    0.000    0.000    0.000    0.000 core.py:5960(isMaskedArray)
   49    0.001    0.000    0.242    0.005 cpuTests.py:10(compute_stuff)
    1    0.013    0.013   10.050   10.050 cpuTests.py:15(run)
    1    0.000    0.000    0.000    0.000 csc.py:1(<module>)
    1    0.000    0.000    0.000    0.000 csc.py:19(csc_matrix)
    1    0.008    0.008    0.020    0.020 csr.py:1(<module>)
    1    0.000    0.000    0.000    0.000 csr.py:21(csr_matrix)
   18    0.000    0.000    0.000    0.000 data.py:106(_create_method)
    1    0.000    0.000    0.000    0.000 data.py:121(_minmax_mixin)
    1    0.000    0.000    0.000    0.000 data.py:22(_data_matrix)
    1    0.000    0.000    0.000    0.000 data.py:7(<module>)
    1    0.000    0.000    0.000    0.000 dia.py:1(<module>)
    1    0.000    0.000    0.000    0.000 dia.py:17(dia_matrix)
    1    0.000    0.000    0.000    0.000 dok.py:1(<module>)
    1    0.000    0.000    0.000    0.000 dok.py:29(dok_matrix)
    1    0.000    0.000    0.000    0.000 extract.py:2(<module>)
   49    0.000    0.000    0.001    0.000 fromnumeric.py:1887(any)
   49    0.005    0.000    0.006    0.000 function_base.py:605(asarray_chkfinite)
   49    0.000    0.000    0.000    0.000 getlimits.py:245(__init__)
   49    0.000    0.000    0.000    0.000 getlimits.py:270(max)
   49    0.000    0.000    0.002    0.000 lapack.py:405(get_lapack_funcs)
   49    0.002    0.000    0.003    0.000 lapack.py:447(_compute_lwork)
    1    0.000    0.000    0.000    0.000 lil.py:19(lil_matrix)
    1    0.002    0.002    0.002    0.002 lil.py:2(<module>)
   49    0.000    0.000    0.000    0.000 misc.py:169(_datacopied)
    3    0.000    0.000    0.000    0.000 nosetester.py:181(__init__)
    3    0.000    0.000    0.000    0.000 ntpath.py:174(split)
    3    0.000    0.000    0.000    0.000 ntpath.py:213(dirname)
    3    0.000    0.000    0.000    0.000 ntpath.py:96(splitdrive)
   49    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
   49    0.000    0.000    0.000    0.000 numeric.py:476(asanyarray)
   98    0.000    0.000    0.000    0.000 numerictypes.py:942(_can_coerce_all)
   49    0.000    0.000    0.000    0.000 numerictypes.py:964(find_common_type)
    5    0.000    0.000    0.000    0.000 re.py:138(match)
    2    0.000    0.000    0.000    0.000 re.py:143(search)
    7    0.000    0.000    0.000    0.000 re.py:230(_compile)
    1    0.000    0.000    0.000    0.000 sputils.py:2(<module>)
    1    0.000    0.000    0.000    0.000 sputils.py:227(IndexMixin)
    3    0.000    0.000    0.000    0.000 sre_compile.py:228(_compile_charset)
    3    0.000    0.000    0.000    0.000 sre_compile.py:256(_optimize_charset)
    3    0.000    0.000    0.000    0.000 sre_compile.py:433(_compile_info)
    6    0.000    0.000    0.000    0.000 sre_compile.py:546(isstring)
    3    0.000    0.000    0.000    0.000 sre_compile.py:552(_code)
    3    0.000    0.000    0.000    0.000 sre_compile.py:567(compile)
    3    0.000    0.000    0.000    0.000 sre_compile.py:64(_compile)
    7    0.000    0.000    0.000    0.000 sre_parse.py:149(append)
    3    0.000    0.000    0.000    0.000 sre_parse.py:151(getwidth)
    3    0.000    0.000    0.000    0.000 sre_parse.py:189(__init__)
   16    0.000    0.000    0.000    0.000 sre_parse.py:193(__next)
    3    0.000    0.000    0.000    0.000 sre_parse.py:206(match)
   13    0.000    0.000    0.000    0.000 sre_parse.py:212(get)
    3    0.000    0.000    0.000    0.000 sre_parse.py:268(_escape)
    3    0.000    0.000    0.000    0.000 sre_parse.py:317(_parse_sub)
    3    0.000    0.000    0.000    0.000 sre_parse.py:395(_parse)
    3    0.000    0.000    0.000    0.000 sre_parse.py:67(__init__)
    3    0.000    0.000    0.000    0.000 sre_parse.py:706(parse)
    3    0.000    0.000    0.000    0.000 sre_parse.py:92(__init__)
    1    0.000    0.000    0.000    0.000 utils.py:117(deprecate)
    1    0.000    0.000    0.000    0.000 utils.py:51(_set_function_name)
    1    0.000    0.000    0.000    0.000 utils.py:68(__init__)
    1    0.000    0.000    0.000    0.000 utils.py:73(__call__)
    3    0.000    0.000    0.000    0.000 {_sre.compile}
    1    0.000    0.000    0.000    0.000 {dir}
  343    0.000    0.000    0.000    0.000 {getattr}
    3    0.000    0.000    0.000    0.000 {hasattr}
  158    0.000    0.000    0.000    0.000 {isinstance}
  270    0.000    0.000    0.000    0.000 {len}
   49    0.000    0.000    0.001    0.000 {method 'all' of 'numpy.ndarray' objects}
   49    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
  211    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
   49    0.000    0.000    0.000    0.000 {method 'astype' of 'numpy.ndarray' objects}
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    5    0.000    0.000    0.000    0.000 {method 'end' of '_sre.SRE_Match' objects}
    6    0.000    0.000    0.000    0.000 {method 'extend' of 'list' objects}
    3    0.000    0.000    0.000    0.000 {method 'find' of 'bytearray' objects}
  205    0.000    0.000    0.000    0.000 {method 'get' of 'dict' objects}
    2    0.000    0.000    0.000    0.000 {method 'group' of '_sre.SRE_Match' objects}
   49    0.000    0.000    0.000    0.000 {method 'index' of 'list' objects}
    3    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
    1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
    5    0.000    0.000    0.000    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
   49    0.021    0.000    0.021    0.000 {method 'random_sample' of 'mtrand.RandomState' objects}
   98    0.001    0.000    0.001    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    3    0.000    0.000    0.000    0.000 {method 'replace' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
    2    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
   60    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
    1    0.000    0.000    0.000    0.000 {method 'update' of 'dict' objects}
    6    0.000    0.000    0.000    0.000 {min}
  147    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
   49    0.036    0.001    0.036    0.001 {numpy.core.multiarray.dot}
    4    0.000    0.000    0.000    0.000 {ord}
   18    0.000    0.000    0.000    0.000 {setattr}
    3    0.000    0.000    0.000    0.000 {sys._getframe}
   49    9.794    0.200    9.794    0.200 {time.sleep}
   99    0.000    0.000    0.000    0.000 {time.time}

Second Edit

I have implemented the equivalent C++ version (below). The C++ version does have the behaviour I was expecting: it is using only 0.3% to 0.5% of CPU usage!

#include <iostream>
#include <chrono>
#include <random>
#include <thread>

// Tune this values to get a computation lasting from 2 to 10ms
#define DIMS 50
#define MULTS 20

/*
This function will compute MULTS times matrix multiplications of transposed(A)*A

We simply want to waste enough time doing computations (tuned to waste between 2ms and 10ms)
*/
double compute_stuff(double A[][DIMS], double B[][DIMS]) {
    double res = 0.0;
    for (int k = 0; k < MULTS; k++) {
        for (int i = 0; i < DIMS; i++) {
            for (int j = 0; j < DIMS; j++) {
                B[i][j] = 0.0;
                for (int l = 0; l < DIMS; l++) {
                    B[i][j] += A[l][j] * A[j][l];
                }
            }
        }
        // We store the result from the matrix B
        for (int i = 0; i < DIMS; i++) {
            for (int j = 0; j < DIMS; j++) {
                A[i][j] = B[i][j];
            }
        }
    }
    for (int i = 0; i < DIMS; i++) {
        for (int j = 0; j < DIMS; j++) {
            res += A[i][j];
        }
    }
    return res;
}


int main() {
    std::cout << "Running main" << std::endl;
    double A[DIMS][DIMS];  // Data buffer for a random matrix
    double B[DIMS][DIMS];  // Data buffer for intermediate computations
    std::default_random_engine generator;
    std::normal_distribution<double> distribution(0.0, 1.0);
    for (int i = 0; i < DIMS; i++) {
        for (int j = 0; j < DIMS; j++) {
            A[i][j] = distribution(generator);
        }
    }
    bool keep_running = true;
    auto prev_time = std::chrono::high_resolution_clock::now();
    auto start_time = prev_time;
    while (keep_running)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(300));
        auto st = std::chrono::high_resolution_clock::now();
        double res = compute_stuff(A, B);
        auto et = std::chrono::high_resolution_clock::now();
        auto iteration_time = std::chrono::duration_cast<std::chrono::milliseconds>(st - prev_time).count();
        auto computation_time = std::chrono::duration_cast<std::chrono::milliseconds>(et - st).count();
        auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(et - start_time).count();
        std::cout << "Time for the whole iteration, inc. sleep:" << iteration_time << " (ms), whereas the processing segment took " << computation_time  << "(ms)" << std::endl;
        keep_running = elapsed_time < 10 * 1000;
        prev_time = st;
    }
}

Here you can also see the output from the C++ equivalent program:

Time for the whole iteration, inc. sleep:314 (ms), whereas the processing segment took 7(ms)
Time for the whole iteration, inc. sleep:317 (ms), whereas the processing segment took 7(ms)
Time for the whole iteration, inc. sleep:316 (ms), whereas the processing segment took 8(ms)
Time for the whole iteration, inc. sleep:316 (ms), whereas the processing segment took 7(ms)
Time for the whole iteration, inc. sleep:314 (ms), whereas the processing segment took 10(ms)

There seems to be something python specific going on. The same behaviour has been confirmed in 3 machines (linux and Windows)

kennethfm
  • 91
  • 1
  • 1
  • 8
  • 2
    What happens if you increase the `sleep` time to, say, 3 seconds, just to see whether whether CPU really stays high or not? – tobias_k Jan 30 '17 at 22:41
  • So how fast is your CPU running during the program's sleep? It might well be halted, meaning the program still dominates *run* time. – Yann Vernier Jan 30 '17 at 22:44
  • With 3 seconds `sleep` the CPU usage is 1.3%. With a 1 seconds `sleep`, it is 4.5%, with 0.5 seconds `sleep` it is 9%, and with 0.1 seconds is already 27%. – kennethfm Jan 30 '17 at 22:45
  • @kennethfm I guess now you have the hint you needed? – Moinuddin Quadri Jan 30 '17 at 22:47
  • @YannVernier , I can't really know how fast the CPU is running, as it is sleeping/waking up several times per second. I'm monitoring from the Task Manager which updates once per second or so. – kennethfm Jan 30 '17 at 22:48
  • @MoinuddinQuadri not just yet. Or maybe I'm missing something? – kennethfm Jan 30 '17 at 22:49
  • It will give you some context on the issue: [Efficient and fast Python While loop while using sleep()](http://stackoverflow.com/questions/10926328/efficient-and-fast-python-while-loop-while-using-sleep) – Moinuddin Quadri Jan 30 '17 at 22:56
  • Thanks @MoinuddinQuadri, I had ran into that post before posting this. However, it is not an explanation, it's only an empirical evaluation of a toy example (from John La Rooy's answer)... which is obvious, the more you compute, the more CPU usage you have. In this case, the computation takes only 2ms, whereas the sleep takes 300ms. This means that per second, you have only 6ms of actual CPU usage... the rest is sleeping. That does not justify 30% CPU usage. – kennethfm Jan 30 '17 at 23:03

4 Answers4

7

I figure out this problem, when i wrote a program for a game.

And i realized even if i create a while infinite loop that just print a hello world msg still cpu usage of my program is 30%.

So i use time.sleep(0.05) in the begining of my while loop and end of that.

And my problem solved. Just play with sleep in your loop. I think can be done.

1

I think you are measuring disparate things, which causes some confusion.

For starters, switching contexts costs; if you have a batch job, it might be better to let the system decide when to switch to other tasks than to insert sleeps yourself. Every time your process sleeps it spends some time calling into the system to be rescheduled and set up an alarm for waking up again, then being restored once that alarm triggers.

CPU usage indications like used by the task manager are traditionally imprecise also. They're meant to find what program is keeping the system busy, and give an indication of what the scheduler is dealing with. For instance, one common indication is that there's a system idle process that takes lots of time; that process is only there for consistency, so that going to sleep when there's nothing else to do isn't a special case for the scheduler.

CPU speed itself is now variable. If your program frequently does little sleeps many computers will slow down to match it, a feature intended to make work like playing videos not require switching between run and sleep modes, which itself takes some time. In particular, once asleep it takes time to start up again, which complicates time based scheduling (sleeps and timeouts) and delays reactions. This means a CPU percentage is only comparable to another under a highly similar load.

Your system likely has a couple of other tasks running in the background, that more rarely need CPU time. Those might be slotted in on the same processor when there are short sleeps, but more often run on another if this task sleeps for longer times. With your program only needing a small portion of one processor's capacity, this makes the percentages vary wildly.

Another aspect we see is that the time measurement is only in milliseconds. With a work slice indicated as anywhere from one to three milliseconds, we have a very large relative quantisation error. The slices are too small to be reliably measured with either task manager or time.time() on this system.

With all those additional variables in mind, all we really know is that the more sleeps you do the more overhead the program has. A tool like unix time(1) would indicate the distribution for the particular task, by splitting time spent in wall (real time elapsed), user (time spent running the program itself) and system (time spent handling the program's calls, including administrative overhead for sleep, but not the time actually slept).

What is the goal of these sleeps? Is that not served better by setting a thread priority?

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
  • Thanks for the long and detailed answer! This code is a minimal example of a video processing application, `compute_stuff` represent: `process_current_frame`, and thus to be called 30 times per second. The application is critical and should have close to zero latency. Therefore intended to have pretty high priority. Related to other points you raised: - Context switching might be costly, but, above it is made only ~3 times per second. - Indeed, I don't know how much to trust the task manager. In Linux I'm getting 120% (more than one core). I had to reduce dims to 12 to get 2ms computation. – kennethfm Jan 31 '17 at 10:14
  • Continuing... CPU variations is a very interesting point. However, I'm concerned that an end-user might not accept to see an application taking 30% of their CPU all the time -- You could indeed expect other tasks in the background (browser, word, etc) -- While quantisation error might be a problem, still the ratio of sleep over computations is around 100x, thus... quantisation might only vary between 95x to 105x or so, thus the measurement although innacurate, might be sufficient to make conclusions. – kennethfm Jan 31 '17 at 10:20
  • You're trying to count the percentage as run time over wall time. It may be your run time over cpu run time, completely disregarding sleep time. Sounds like your load should behave similar to e.g. Leap Motion or OSVR tracking processes. – Yann Vernier Jan 31 '17 at 11:11
  • True that I'm using wall time, but I would expect real run times are lower than wall time, and therefore even lower in terms of CPU usage (CPU usage for me is 100*run_time/total_time). In the (second) edit I've added, you can see what happens when the same framework is implemented in C++. The CPU usage goes much lower! As expected. What is going on in python then? Yes, the load should behave similar to systems with a constant processing of incoming data. – kennethfm Feb 01 '17 at 19:27
0

I had the same issue and managed to solve it by forcing numpy/scipy to use only one thread with BLAS. Define the number of threads as an environmental variable or add the following line BEFORE importing numpy and scipy to force it to use one thread.

import os
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
import numpy as np
import scipy.linalg

I'm not sure of the underlying issue but having any sort of delay, such as from time.sleep or a blocking I/O operation conflicts with multithreaded numpy causing it to waste CPU cycles rather than sleeping properly. Operations such as np.mean which aren't multithreaded do not appear to cause this issue. However, if there is no delay, multithreaded numpy will still be a lot faster in computations than if it is forced to use one thread.

0

I noticed the same thing, that calling time.sleep from python it's a bit more CPU intensive. I didn't get into your huge performance problems, but since I want my idling processes to be idling, not chew CPU, I went with a solution of monkey patching the time.sleep and replacing it with a native call to usleep. (my main target is Linux)

import time
from ctypes import cdll

glibc = None


def _custom_sleep(t):
    glibc.usleep(int(t * 1000000))


def patch_time():
    global glibc
    try:
        glibc = cdll.LoadLibrary("libc.so.6")

        time.sleep = _custom_sleep
    except Exception as e:
        print(f"Failed to patch time.sleep: {e}. Performance might be worse.")

I call patch_time in the mainapp. CPU usage dropped.

bogdan.mustiata
  • 1,725
  • 17
  • 26