why double the cpu limit leads to only 20% time cost improvement?

Question

I use python3 to do some encrypted calculation with MICROSOFT SEAL and is looking for some performance improvement. I do it by:

create a shared memory to hold the plaintext data (Use numpy array in shared memory for multiprocessing)
start multiple processes with multiprocessing.Process (there is a param controlling the number of processes, thus limiting the cpu usage)
processes read from shared memory and do some encrypted calculation
wait for calculation ends and join processes

I run this program on a 32U64G x86 linux server, cpu model is: Intel(R) Xeon(R) Gold 6161 CPU @ 2.20GHz.

I notice that if I double the number of processes there is only about 20% time cost improvement. I've tried three kinds of process nums:

| process nums |   7 | 13 |  27 |
| time ratio   | 0.8 |  1 | 1.2 |

Why is this improvement disproportionate to the resources i use (cpu & memory)? Conceptual knowledge or specific linux cmdlines are both welcome. Thanks.

FYI:

My code of sub processes is like:

def sub_process_main(encrypted_bytes, plaintext_array, result_queue):
  // init
  
  // status_sign 
  while shared_int > 0:
    // seal load and some other calculation
    encrypted_matrix_list = seal.Ciphertext.load(encrypted_bytes)
    shared_plaintext_matrix = seal.Encoder.encode(plaintext_array)
    
    // ... do something 

    for some loop: 
        time1 = time.time()
        res = []
        for i in range(len(encrypted_matrix_list)):
          enc = seal.evaluator.multiply_plain(encrypted_matrix_list[i], shared_plaintext_matrix[i])
          res.append(enc)
        time2 = time.time()
        print(f'time usage: {time2 - time1}')
          
    // ... do something     

    result_queue.put(final_result)

I actually print the time for every part of my code and here is the time cost for this part of code.

| process nums |        13 |        27 |
| occurrence   |      1791 |       864 |
| total time   | 1698.2140 | 1162.8330 |
| average      |    0.9482 |    1.3459 |

I've monitored some metrics but I don't know if there are any abnormal ones.

13 cores:

top

pidstat

vmstat

27 cores:

top (Why is this using all cores rather than exactly 27 cores? Does it have anything to do with Hyper-Threading?)

pidstat

vmstat

There are more factors to consider than just the number of processes you've created and the number of cores a CPU has. If you're running this on a modern OS like Linux the kernel's scheduler decides when each task runs on the CPU, even if you have a sufficient number of cores with regard to the number of processes. Other than that there are probably implementation details in your hardware that you need to consider. On the other hand, it could be a simple case of how you solved the problem. There aren't any details about how you perform the encrypted calculation. Your question is complex. — Thomas Dignan, Jan 29 '22 at 08:57
Why processes and not threads? `processes read from shared memory` In what way and what synchronization do they use? `specific linux cmdlines` perf. A lot of perf — KamilCuk, Jan 29 '22 at 09:02
@ThomasDignan I've posted my code. I'm watching the time cost of single subprocess on nearly a single call of SEAL api. The time costs are so different. — Nelofie, Jan 29 '22 at 09:38
@KamilCuk I read about that python threads are restricted by python GIL., thus multi-threading has very poor performance. — Nelofie, Jan 29 '22 at 09:42
make sure you don't have any process idly wasting time while it waits to read the shared memory — Thomas Dignan, Jan 29 '22 at 11:12
Can you run [`lstopo`](https://linux.die.net/man/1/lstopo) on the machine and report the result (alternatively `hwloc-ls` if the `lstopo` result is small or does not work)? — Jérôme Richard, Jan 29 '22 at 11:24

why double the cpu limit leads to only 20% time cost improvement?

0 Answers0