As far as I understand a sampling profiler works as follows: it interupts the program execution in regular intervals and reads out the call stack. It notes which part of the program is currently executing and increments a counter that represents this part of the program. In a post processing step: For each function of the program the ratio of the whole execution time is computed, for which the function is responsible for. This is done by looking at the counter C for this specific function and the total number of samples N:
ratio of the function = C / N
Finding the hotspots then is easy, as this are the parts of the program with a high ratio.
But how can this be done for a parallel program running on parallel hardware. As far as I know, when the program execution is interupted the executing parts of the program on ALL processors are determined. Due to that a function which is executed in parallel gets counted multiple times. Thus the number of samples C of this function can not be used for computing its share of the whole execution time anymore.
Is my thinking correct? Are there other ways how the hotspots of a parallel program can be identified - or is this just not possible using sampling?