13

So I know this isn't the recommended technique (preallocating is better), but I got really curious about this timing behavior; I'm curious what might be happening under the hood.

In my head, adding an element to an array could induce a couple different reasonable behaviors in memory depending on implementation: (1) amortized, it would take the same amount of time to add an element like in a linked list where you maintain a pointer to the last element, (2) it could take a big chunk of time now and then to preallocate enough memory for, say, twice as many elements as currently in the list (like a Java array), (3) something more clever than I can think of.

MATLAB seems to do something wonky that I don't quite grock. There seems to be a linear increase in cost with occasional spikes. Any guesses (or intelligent explanations) of what it might be doing? I averaged over simulations (which, I submit, might hide some interesting patterns).

This is what happens when you add one element to the end of an initially empty list iteratively. Why the linear increase? Is there a cool reason for those seemingly periodic spikes?

iterative time behavior

The code I used to generate it:

% for averaging over
num_averages = 100000;
% number of simulations
num_sims = 10000;
% the time it takes to add one more item, array
time_store = nan(num_sims, num_averages);

% averaging count
for i = 1:num_averages

    % an array that grows with every loop
    building_array = [];

    for j = 1:num_sims
        tic;
        building_array = [building_array 1];
        time_store(j, i) = toc;
    end
end

plot(mean(time_store, 2)); hold all;
xlabel('Element num'); ylabel('Time');

1 Answers1

15

This is really interesting! There are peaks at very regular intervals, and the curve is more or less flat in between. After each peak the line rises a bit. Neat! I think this is related to cache lines. You are measuring the cost of copying the array, which is presumably related to the cost of reading and writing cache lines.

If you replace the line

building_array = [building_array 1];

with

building_array(end+1) = 1;

then you won't be copying the data at every iteration loop, but actually appending an element to the array. With this change I get a mostly flat line, with peaks at logarithmically increasing distances (and of logarithmically increasing height too), which is consistent with the common-sense implementation of doubling array size when needed:

enter image description here

Just some more explanation: building_array = [building_array 1] creates a new array, one element larger than building_array, and copies building_array and 1 into it. This is then assigned to building_array. It seems that MATLAB's JIT has not yet been optimized for this code pattern!

Cris Luengo
  • 55,762
  • 10
  • 62
  • 120
  • 3
    It would be good if someone could explain the spikes. I understand the plateaus, but not the spikes in between. – Cris Luengo Jan 20 '18 at 15:54
  • Maybe MATLAB tries to store to contiguous blocks of mem? A challenge that would become increasingly difficult while reallocating for larger arrays (potentially causing the spikes) @ChrisLuengo – Peter Barrett Bryan Jan 21 '18 at 15:33
  • Hypothesis: You might also be seeing C malloc effects here. When extending an array with `(end+1)` (and there are no other refs to that array so it can do the in-place-modification optimization), even if it isn't doing padding itself, Matlab probably calls down to C's `realloc` to expand the raw data array, and if there's free malloc-managed space right after the end of the array, malloc can extend it in place. You get a spike when realloc can't expand in place and has to copy the entire array somewhere else to get a larger block of contiguous memory. – Andrew Janke Dec 10 '21 at 17:33
  • @AndrewJanke Interesting thought. We should be able to do an experiment with `malloc` to see if we get the same logarithmic behavior. The size duplication is *very* regular though. -- Oh, wait, are you referring to OP's graph? There's a copy there for each increment, it's just that at some specific sizes the copy operation takes longer than for nearby sizes. It's a very weird behavior! – Cris Luengo Dec 10 '21 at 17:37
  • No, I mean your graph in the answer. I'm no malloc expert, but it might have to do with how the malloc implementation organizes its free memory areas in to "bins" based on size. In a process where there's little other large contiguous memory allocation happening, the array gets to expand in place until it hits the bin size limit, and then it has to get moved to a larger bin (your spikes), and bins increase in size by powers of 8 or so. (At least in glibc, I think.) It'll look like "expand by doubling" vector behavior I think. No idea what's up w OP's graph. – Andrew Janke Dec 10 '21 at 18:14
  • Also, a MathWorker I know says that Matlab _does_ allocate extra space at the end of an array, `std::vector` style and not just relying on this malloc behavior, so `(end+1)` is adding to actually-already-allocated memory until you need to re-expand it when capacity is hit, but I have not been able to independently corroborate that. – Andrew Janke Dec 10 '21 at 18:54
  • 2
    @AndrewJanke It took me a while to find this, was using all the wrong keywords. But somehow I did remember that someone at MathWorks wrote a blog post about this. This optimization was introduced 10 years ago. https://blogs.mathworks.com/steve/2011/05/16/automatic-array-growth-gets-a-lot-faster-in-r2011a/ – Cris Luengo Dec 10 '21 at 19:14
  • @CrisLuengo: That is exactly the reference I've been looking for. Thank you! – Andrew Janke Dec 11 '21 at 04:50