1

I have code which looks like this:

def get_image_stats(fp):
    img = cv2.imread(fp)
    return img.shape[0], img.shape[1], img.shape[0]/img.shape[1]

with ThreadPool(16) as pool:
    res = list(tqdm(pool.imap_unordered(get_image_stats, df.file_path), total=len(df)))

heights, widths, ars = list(zip(*res))

The only library specific part there is cv2.imread which is simply loading an image file into a numpy array, so it's I/O bound.

Why would my CPU usage look like this?

enter image description here

Notes on that image:

  • Horizontal axis i time in seconds, and vertical axis is cpu % usage ranging from 0% to 100%. The update interval is 1 second.
  • 40s is where I started the script
  • It's not easy to see, but there are 16 cores.

Another note: I did not set n_workers to 16 because I have 16 cores. Just a coincidence.

So why is this using up 75% of 16 cores at once?

Alexander Soare
  • 2,825
  • 3
  • 25
  • 53

1 Answers1

1

Because your thread pool is going to use 1 core per thread if it can. That's what gives maximum parallelism and maximizes throughput.

Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
  • 1
    Mind blown. I thought that's what `Pool` was for. I didn't realise `ThreadPool` would start more processes. Actually, let me step back a bit and explain my reasoning: 1) you said 1 core per thread, 2) I interpret that as multiple cores working at once, 3) I know about the GIL and know that can only work if there are multiple processes, 4) I thought that `Pool` is how you manage multiple processes, and I thought that `ThreadPool` is how you manage multiple threads within 1 process – Alexander Soare Apr 26 '21 at 18:40
  • Further to the above: https://stackoverflow.com/a/46049195/4391249. – Alexander Soare Apr 26 '21 at 18:46
  • Well, what would you expect? If it *doesn't* use more than one core, then the best you'll get is high load on one core. But you are getting high load on all 16 cores. QED. Try reducing the size of the pool to 8 and see what happens? – Charlie Martin Apr 26 '21 at 18:48
  • 1
    I get that. I suppose I need some time to reconfigure my understanding of multithreading in Python. Hope you can appreciate that. Will accept your answer shortly. – Alexander Soare Apr 26 '21 at 18:50
  • 1
    Oh sure, I didn't mean to smack you around. Accounting for the GIL is probably a big part of why it's not 95 percent of all cores. – Charlie Martin Apr 26 '21 at 18:54