I'm trying to remove outliers from a tick data series, following Brownlees & Gallo 2006 (if you may be interested).
The code works fine but given that I'm working on really long vectors (the biggest has 20m observations and after 20h it was not done computing) I was wondering how to speed it up.
What I did until now is:
I changed the time and date format to numeric double and I saw that it saves quite some time in processing and A LOT OF MEMORY.
I allocated memory for the vectors:
[n] = size(price);
x = price;
score = nan(n,'double'); %using tic and toc I saw that nan requires less time than zeros
trimmed_mean = nan(n,'double');
sd = nan(n,'double');
out_mat = nan(n,'double');
Here is the loop I'd love to remove. I read that vectorizing would speed up a lot, especially using long vectors.
for i = k+1:n
trimmed_mean(i) = trimmean(x(i-k:i-1 & i+1:i+k),10,'round'); %trimmed mean computed on the 'k' closest observations to 'i' (i is excluded)
score(i) = x(i) - trimmed_mean(i);
sd(i) = std(x(i-k:i-1 & i+1:i+k)); %same as the mean
tmp = abs(score(i)) > (alpha .* sd(i) + gamma);
out_mat(i) = tmp*1;
end
Here is what I was trying to do
trimmed_mean=trimmean(regroup_matrix,10,'round',2);
score=bsxfun(@minus,x,trimmed_mean);
sd=std(regroup_matrix,2);
temp = abs(score) > (alpha .* sd + gamma);
out_mat = temp*1;
But given that I'm totally new to Matlab, I don't know how to properly construct the matrix of neighbouring observations. I just think it should be shaped like: regroup_matrix= nan (n,2*k)
.
EDIT: To be specific, what I am trying to do (and I am not able to) is:
Given a column vector "x" (n,1) for each observation "i" in "x" I want to take the "k" neighbouring observations to "i" (from i-k to i-1 and from i+1 to i+k) and put these observations as rows of a matrix (n, 2*k).
EDIT 2: I made a few changes to the code and I think I am getting closer to the solution. I posted another question specific to what I think is the problem now:
Matlab: Filling up matrix rows using moving intervals from a column vector without a for loop
What I am trying to do now is:
[n] = size(price,1);
x = price;
[j1]=find(x);
matrix_left=zeros(n, k,'double');
matrix_right=zeros(n, k,'double');
toc
matrix_left(j1(k+1:end),:)=x(j1-k:j1-1);
matrix_right(j1(1:end-k),:)=x(j1+1:j1+k);
matrix_group=[matrix_left matrix_right];
trimmed_mean=trimmean(matrix_group,10,'round',2);
score=bsxfun(@minus,x,trimmed_mean);
sd=std(matrix_group,2);
temp = abs(score) > (alpha .* sd + gamma);
outmat = temp*1;
I have problems with the matrix_left and matrix_right creation. j1, that I am using for indexing is a column vector with the indices of price's observations. The output is simply
j1=[1:1:n]
price is a column vector of double with size(n,1)