On modern multi-core platforms parallel performance of memory bandwidth bounded applications often does not scale well with the number of cores. Usually, speedup is observed up to some number of cores, but after that the performance saturates. A synthetic example is the well-known STREAM benchmark, which is often used to report the achievable memory bandwidth, i.e., memory bandwidth at the saturation point.
Consider the following results of the STREAM benchmark (Triad) on a single Xeon E5-2680 with a peak memory bandwidth of 42.7GB/s (DDR3-1333):
1 core 16 GB/s
2 cores 30 GB/s
3+ cores 36 GB/s
STREAM scales well from 1 to 2 cores, but above 3 cores the performance is roughly constant.
My question is: what determines the memory bandwidth that can be achieved by a single CPU core? Since this question is definitely too broad, I narrow it down to the above mentioned architecture: how can I predict that STREAM with 1 thread will give me 16 GB/s from the specs of E5-2680, or by looking at the hardware counters etc?