I am learning cache operations with regard to spatial locality. (My references so far are Principles of Parallel Programming by Lin and Snyder, this tutorial, and of course Wikipedia.)
Take the following example, compiled with gcc, running on Windows 7 Professional, using Intel Core2 Duo CPU (L7500).
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main()
{
int *array;
int length;
int count;
int range;
int i;
// generate an array of a million integers between 0 and 99
length = 1000000;
range = 100;
array = calloc(length, sizeof(int));
srand(time(NULL));
for(i = 0; i < length; i++)
{
array[i] = rand() % range;
// printf("%d\n", array[i]);
}
// count the number of occurrences of 3 in the array
count=0;
for(i=0; i<length; i++)
{
if(array[i]==3)
{
count++;
}
}
printf("count = %6d\n", count);
return 0;
}
Now in the second half of the routine, the entire array of integers is going to be read, so per spatial locality the CPU should load them into the cache in advance. But how much of the array can/does/should it load into the cache at any one time during the loop? One cache line at a time (64 bytes / 4 bytes per int = 16 integers), large blocks of it, or the entire array in one fell swoop?
Also, from what I understand, the latency involved in loading data from RAM into the cache (or per the textbook, from non-local to local memory) can be much more significant than the time required to actually run the routine. True?
Now say we move this code to a multiprocessor/multicore machine, and the counting portion of the code is changed to run in 4, 8, 16, etc. parallel threads (using pthreads), counting separate portions of the array, then adding the private counts together at the end. Could this cause multiple separate occurrences of RAM-to-cache latency, making the parallel version run slower than the serial one?