Dramatic slow-down when executing multiple processes at the same time

Question

I write a very simple code which contains summation of arrays by using both Fortran and Python. When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one.

The Fortran version of my code is presented as follows

program main
implicit none
real*8 begin, end, Ht(2, 2), ls(4)
integer i, j, k, ii, jj, kk
integer,parameter::N_tiles = 20
integer,parameter::N_tilings = 100
integer,parameter::max_t_steps = 50
real*8,dimension(N_tiles*N_tilings,max_t_steps,5)::test_e, test_theta
real*8 rand_val

call random_seed()
do i = 1, N_tiles*N_tilings
  do j = 1, max_t_steps
    do k = 1, 5
      call random_number(rand_val)
      test_e(i, j, k) = rand_val
      call random_number(rand_val)
      test_theta(i, j, k) = rand_val
    end do
  end do
end do

call CPU_TIME(begin)
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
call CPU_TIME(end)

write(*, *) 'total time cost is : ', end-begin

end program main

and a shell-scipt is presented as follows

#!/bin/bash
gfortran -o result test.f90

nohup ./result &
nohup ./result &
nohup ./result &

As we can see, the main operation is the summation of array test_theta and test_e. These arrays are not large (3MB approximately) and the memory space of my computer is enough for this job. My work station has 6 cores with 12 threads. I try to submit 1, 2, 3, 4 and 5 jobs by using shell at one time, and the cost of time is presented as follows

| #jobs   |  1   |   2   |   3    |  4    |  5   |
| time(s) |  21  |   31  |   161  |  237  |  357 |

I expect that the time for n-thread job should be the same as the single-thread job once the number of threads is smaller than the number of cores we have, which is 6 here for my computer. However, we find dramatic slow-down here.

This problem still exists when I use Python to implement the same task

import numpy as np 
import time

N_tiles = 20
N_tilings = 100
max_t_steps = 50
theta = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)
e = np.ones((N_tiles*N_tilings, max_t_steps, 5), dtype=np.float64)

begin = time.clock()

for i in range(1001):
    for j in range(50):
        theta += 0.5*e

end = time.clock()
print('total time cost is {} s'.format(end-begin))

I don't know the reason and I wonder whether it is related to the size of L3 cache of CPU. That is, cache is too small for such multi-thread job. Maybe it is also related to the so-called "false sharing" problem. How can I fix this ?

This question is related to a former one dramatic slow down using multiprocess and numpy in python and here I just post a simple and typical example.

*"When I submit multiple (independent) jobs using shell, there will be dramatic slow-down when the number of threads is larger than one."* You mean you run multiple jobs at the same time? Why would you do that? I can't see any threading in your code. How do you make threads? Note that threads and processes are NOT the same thing. — Vladimir F Героям слава, Nov 24 '17 at 14:12
Also, `CPU_TIME()` is completely inadequate for parallel computing. Discussed here many times. Use `system_clock()` instead. — Vladimir F Героям слава, Nov 24 '17 at 14:14
There is no parallel computing such `openMP` in `fortran` or `multiprocessing` in `python`. I just submit multiple jobs at the same time using shell. When I submit 4 jobs, for example, I use `top` command and I find that there are 4 threads being occupied with each having 100% CPU usage. — JunjieChen, Nov 24 '17 at 14:41
sorry, I don't really understand the conception of threads and processes. What I want to do is using multiple cores of my computer and each core just handles one job. `CPU_TIME()` might be inadequate just as you said, but I can actually feel that the time cost is much longer than I expected. — JunjieChen, Nov 24 '17 at 14:43
In fact, this problem is related to the former question [multiprocess](https://stackoverflow.com/questions/47380366/dramatic-slow-down-using-multiprocess-and-numpy-in-python) and here I just post a simple and typical example. — JunjieChen, Nov 24 '17 at 14:45
Isn't it because the hot loop between CPU_TIME() is a very simple whole-array statement and so the calculation is limited by memory bandwidth? — roygvib, Nov 24 '17 at 14:52
I think so. I guess you are not really using some multi socket machine. There is no work for the CPU cores to be done. They are just waiting for the memory. — Vladimir F Героям слава, Nov 24 '17 at 14:53
BTW, If I make the array 50 times smaller, I get: 0.41s for one process and 0.38, 0.28, 0.24 and 0.24 for four processes. — Vladimir F Героям слава, Nov 24 '17 at 14:56
Thus it is just because the speed of memory is slow, correct ? Is there any way to ease this problem ? For example, divide the array into small pieces. — JunjieChen, Nov 24 '17 at 14:56
It would be better to discuss that over some real code that you need to make faster. It depends on the details. — Vladimir F Героям слава, Nov 24 '17 at 14:57
In the real code, the most time-consuming part is the summation of arrays which have the same size as I presented in the example. In fact, such summation has occupied 90% of total time. — JunjieChen, Nov 24 '17 at 14:59
You can try threading instead of multiple processes. I can see a nice speed-up. — Vladimir F Героям слава, Nov 24 '17 at 15:09
Though not so not sure at all, this kind of page might be related https://stackoverflow.com/questions/16123970/most-efficient-way-to-weight-and-sum-a-number-of-matrices-in-fortran (and another possibility is to use BLAS or MKL or some library? but not sure again..) — roygvib, Nov 24 '17 at 15:14
With OpenMP I get speed-up by a factor of almost 4 with 4 cores. 15.5 s and 3.9 s. — Vladimir F Героям слава, Nov 24 '17 at 15:15
How do you use the `OpenMP` ? Do you mean to parallel the outter loop `do i = 1, 1001` ? I once use `omp` in my real code and it does not work. I shall try it again. Can you share your code with me by answering this question ? — JunjieChen, Nov 24 '17 at 15:32
Yes, I use `OpenMP` and I also get speed-up for this example `fortran` code. I shall try to use it in real code and see what happened. — JunjieChen, Nov 24 '17 at 15:47

score 0 · Accepted Answer · answered Nov 24 '17 at 15:45

The code is likely slow when running multiple times, because you have more and more memory that must flow through the limited bandwidth memory buses.

If you run just one process, that works just with one array at one time, but enable OpenMP threading, it can be made faster:

integer*8 :: begin, end, rate
...

call system_clock(count_rate=rate)
call system_clock(count=begin)

!$omp parallel do
do i = 1, 1001
  do j = 1, 50
    test_theta = test_theta+0.5d0*test_e
  end do
end do
!$omp end parallel do

call system_clock(count=end)
write(*, *) 'total time cost is : ', (end-begin)*1.d0/rate

On a quad-core CPU:

> gfortran -O3 testperformance.f90 -o result
> ./result 
 total time cost is :    15.135917384000001
> gfortran -O3 testperformance.f90 -fopenmp -o result
> ./result 
 total time cost is :    3.9464441830000001

I give another counterexample by moving the summation into a subroutine defined in an extra file and then call it from main file. However, if I run such version, `OpenMP` is much slower than single thread one. I post it in another question (https://stackoverflow.com/questions/47478641/slow-down-when-using-openmp-and-calling-subroutine-in-a-loop) and you can see it for details if you have time — JunjieChen, Nov 24 '17 at 18:50

Dramatic slow-down when executing multiple processes at the same time

1 Answers1

Linked