I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s);
line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.