0

I have a list of function pointers called tasks_ready_master. The pointers point to functions (tasks) defined in a seperate module. I want to execute them in parallel using threads. Each thread has a queue called "thread_queue" of capacity 1. This queue will contain the task that should be executed by the thread. Once it is done, the task is retired from the queue. We have also a queue where we put all the tasks (called "master _queue"). This is my implementation for the execution subroutine:

subroutine master_worker_execution(self,var,tasks_ready_master,first_task,last_task)

type(tcb),dimension(20)::tasks_ready_master !< the master array of tasks 
integer::i_task !< the task counter 
type(tcb)::self !< self
integer,intent(in)::first_task,last_task 
type(variables),intent(inout)::var !< the variables
!OpenMP variables
integer::num_thread !< the rank of the thread
integer:: OMP_GET_THREAD_NUM !< function to get the rank of the thread
type(QUEUE_STRUCT),pointer:: thread_queue
type(QUEUE_STRUCT),pointer::master_queue
logical::success
integer(kind = OMP_lock_kind) :: lck !< a lock

call OMP_init_lock(lck) !< lock initialization 
!$OMP PARALLEL PRIVATE(i_task,num_thread,thread_queue) &
!$OMP SHARED(tasks_ready_master,self,var,master_queue,lck)
num_thread=OMP_GET_THREAD_NUM() !< the rank of the thread 

!$OMP MASTER
call queue_create(master_queue,last_task-first_task+1)   !< create the master queue 

do i_task=first_task,last_task
   call queue_append_data(master_queue,tasks_ready_master(i_task),success)   !< add the list elements to the queue (full queue)
end do
!$OMP END MASTER 
!$OMP BARRIER

if (num_thread  .ne. 0) then 
   do while (.not. queue_empty(master_queue))  !< if the queue is not empty
      call queue_create(thread_queue,1) !< create a thread queue of capacity 1
      call OMP_set_lock(lck)  !< set the lock 
      call queue_append_data(thread_queue,master_queue%data(1),success) !< add the first element of the list to the thread queue
      call queue_retrieve_data(master_queue) !< retire the first element of the master queue
      call OMP_unset_lock(lck) !< unset the lock 
      call thread_queue%data(1)%f_ptr(self,var) !< execute the one and only element of the thread queueu
      call queue_retrieve_data(thread_queue) !< retire the element 
   end do
end if
!$OMP MASTER
call queue_destroy(master_queue)  !< destory the master queue 
!$OMP END MASTER

call queue_destroy(thread_queue)  !< destroy the thread queue 
!$OMP END PARALLEL
call OMP_destroy_lock(lck)   !< destroy the lock 

end subroutine master_worker_execution

The problem is that I get a segmentation fault:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f30fd3ca700 in ???
#0  0x7f30fd3ca700 in ???
#1  0x7f30fd3c98a5 in ???
#1  0x7f30fd3c98a5 in ???
#2  0x7f30fd06920f in ???
#2  0x7f30fd06920f in ???
#3  0x56524a0f1d08 in __master_worker_MOD_master_worker_execution._omp_fn.0
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/master_worker.f90:70
#4  0x7f30fd230a85 in ???
#3  0x56524a0f1ad7 in __queue_MOD_queue_destroy
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/queue.f90:64
#4  0x56524a0f1d94 in __master_worker_MOD_master_worker_execution._omp_fn.0
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/master_worker.f90:81
#5  0x7f30fd227e75 in ???
#6  0x56524a0f1f68 in __master_worker_MOD_master_worker_execution
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/master_worker.f90:54
#7  0x56524a0f29b5 in __app_management_MOD_management
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/app_management_without_t.f90:126
#8  0x56524a0f579b in hecese
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/program_hecese.f90:398
#9  0x56524a0ed26e in main
    at /home/hakim/stage_hecese_HPC/OpenMP/hecese_OMP/program_hecese.f90:13
Erreur de segmentation (core dumped)

I tried to retire the while loop and it works (no seg fault). I don't understand where the mistake came from. While debugging with gdb, it guides me to the line where we use queue_append_data and queue_retrieve_data.

This is the ouput I get when I use valgrind:

==13100== Memcheck, a memory error detector
==13100== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==13100== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==13100== Command: ./output_hecese_omp
==13100== 
==13100== Thread 3:
==13100== Jump to the invalid address stated on the next line
==13100==    at 0x0: ???
==13100==    by 0x10EB64: __master_worker_MOD_master_worker_execution._omp_fn.0 (master_worker.f90:73)
==13100==    by 0x4C8BA85: ??? (in /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0)
==13100==    by 0x4F1D608: start_thread (pthread_create.c:477)
==13100==    by 0x4DD7292: clone (clone.S:95)
==13100==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==13100== 

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x4888700 in ???
#1  0x48878a5 in ???
#2  0x4cfb20f in ???
#3  0x0 in ???
==13100== 
==13100== Process terminating with default action of signal 11 (SIGSEGV)
==13100==    at 0x4CFB169: raise (raise.c:46)
==13100==    by 0x4CFB20F: ??? (in /usr/lib/x86_64-linux-gnu/libc-2.31.so)
==13100== 
==13100== HEAP SUMMARY:
==13100==     in use at exit: 266,372 bytes in 121 blocks
==13100==   total heap usage: 194 allocs, 73 frees, 332,964 bytes allocated
==13100== 
==13100== LEAK SUMMARY:
==13100==    definitely lost: 29,280 bytes in 3 blocks
==13100==    indirectly lost: 2,416 bytes in 2 blocks
==13100==      possibly lost: 912 bytes in 3 blocks
==13100==    still reachable: 233,764 bytes in 113 blocks
==13100==         suppressed: 0 bytes in 0 blocks
==13100== Rerun with --leak-check=full to see details of leaked memory
==13100== 
==13100== For lists of detected and suppressed errors, rerun with: -s
==13100== ERROR SUMMARY: 3 errors from 1 contexts (suppressed: 0 from 0)
hakim
  • 139
  • 15
  • You only showed part of the file from which master_worker_execution looks like to be part, so which line is line 70 here? Did you also compile with boundary checking etc.? – albert Aug 16 '21 at 09:41
  • Wild guess: Might be that `call queue_destroy(thread_queue)` has to be "protected" like `call queue_destroy(master_queue)` – albert Aug 16 '21 at 09:43
  • @albert I tried using `fcheck=all` and got: `Fortran runtime error: Recursive call to nonrecursive procedure 'queue_create' `. How do I fix it, please ? for your second comment, I don't think so since even without destroying the queues it doesn't work. – hakim Aug 16 '21 at 09:51
  • @albert I edited my question so everybody can see the valgrind output. – hakim Aug 16 '21 at 10:04
  • The message `Fortran runtime error: Recursive call to nonrecursive procedure 'queue_create' ` gives a clue about the setup of the queues (but unfortunately not to me as I don't know `omp`, that is why I also added "wild guess"). Though the question about the line number is not answered, there are probably some lines above the `subroutine master_worker_execution(self,var,tasks_ready_master,first_task,last_task)` so that it is not line 1 in the original file, what is the original line number? – albert Aug 16 '21 at 10:23
  • @albert sorry for not answering your question ! the line 70 is `call queue_append_data(thread_queue,master_queue%data(1),success) !< add the first element of the list to the thread queue`. The thing that I can add is that when I comment `call thread_queue%data(1)%f_ptr(self,var)` it works well but when I leave it, it shows me a seg fault. I don't know if you understand me. – hakim Aug 16 '21 at 10:31
  • 7
    Please provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), otherwise all is wild guesses. From what you have provided so far *nobody* but you can tell what line 70 in master_worker.f90 is, so it is all but impossible to do anything else. As for recursive procedures before Fortran 2018 all recursive procedures had to be declared as such. Your Fortran book will cover how to do this. see https://stackoverflow.com/questions/31756906/recursive-fortran-function-return-array for an example. – Ian Bush Aug 16 '21 at 10:32
  • If it "works well" when you comment out the offending line, then why not simply remove the line! – steve Aug 16 '21 at 17:01
  • @steve because if I remove it, I can't execute all the tasks but only a limited number (= number of workers) .. The work will be done partially .. We will leave some tasks not executed in the queue .. – hakim Aug 18 '21 at 12:32

0 Answers0