Firstly, this is going to be I/O-bound from reading the data in. Secondly, it's going to be memory-bound. You'll get much better cache performance if you interleave the conversion with the reading.
Pick some reasonable buffer size that's large enough for good I/O performance but small enough to fit in your cache, maybe 8-32 KB or so. Read in that much data, convert, and repeat.
For example:
#define BUFSIZE 16384
uint8_t *buffer = malloc(BUFSIZE);
float *s1 = malloc(num_elements * sizeof(float));
int total_read = 0;
int n;
while(total_read < num_elements && (n = fread(buffer, 1, BUFSIZE, file_id)) > 0)
{
n = min(n, num_elements - total_read);
for(int i = 0; i < n; i++)
s1[total_read + i] = (float)buffer[i];
total_read += n;
}
free(buffer);
You might also see improved performance by using SIMD operations to convert multiple items at once. However, the total performance will still be bottlenecked by the I/O from fread, so how much improvement you might see from SIMD will be questionable.
Since you're converting a large number of uint8_t
values, it's all possible you could get some improved performance by using a lookup table instead of doing the integer to floating point conversion. You'd only need a lookup table of 256 float values (1 KB), which easily fits in cache. I don't know if that would be faster or not, so you should definitely profile the code to figure out what the best option is.