0

At a point in time, I need to find all Targets that have points earlier and later than that time.

Currently I am doing the following:

for current_time = sorted_set_of_times;
    target_set = find([Targets.First_Time] <= current_time ... 
        & [Targets.Last_Time] >= current_time);
    % Using target_set here. Targets are not modified within this loop
end

MATLAB profiler is telling me that this line takes about 40 minutes with 40,000 calls.

Is there a way that I can do this more efficiently?

current_time is incremented in a for_loop in order. I was thinking that the fact that this is always increasing could be used to shortcut some of the testing. For example Targets.First_Time <= current_time will be a subset of Targets.First_Time <= current_time+n.

numel(sorted_set_of_times) is approximately 40,000

numel(Targets) is more than 10 million

Samuel O'Malley
  • 3,471
  • 1
  • 23
  • 41
  • 1. struct could be causing it to be slower. 2. logical indexing might help remove find and thus get some performance improvement. Is `current_time` incremented with constant stepsize? Could you share more of the related code, the loop code maybe? – Divakar Aug 04 '14 at 05:35
  • Hi Divakar, sorry I can't share the actual code because it is on an air-gapped system. `current_time` does not increment evenly, but it is sorted. – Samuel O'Malley Aug 04 '14 at 05:43
  • Do `Targets.First_Time` or `Targets.Last_Time` change within that loop? – Divakar Aug 04 '14 at 05:44
  • No, Targets is unmodified – Samuel O'Malley Aug 04 '14 at 05:44
  • `numel(sorted_set_of_times)`? – Divakar Aug 04 '14 at 05:52
  • Approximately 40,000. Targets is 10M+. This is why I am thinking a sorted set of first/last times would be a lot more efficient than doing an O(n) search on 10 million targets 40,000 times – Samuel O'Malley Aug 04 '14 at 05:54

4 Answers4

1

Approach #1

Vectorized solution with bsxfun -

eqcond = bsxfun(@le,Targets.First_Time,sorted_set_of_times) & bsxfun(@ge,Targets.Last_Time,sorted_set_of_times);
[r,c] = find(eqcond);
for k =1:numel(sorted_set_of_times)
    target_set = r(c==k);
end

Suggestion to use logical indexing (if applicable): If you are using target_set to index into one of the dimensions of some variable named target, which is of length same as Targets.First_Time for that dimension, then instead of calculating target_set you can directly index into that dimension of target using eqcond(:,ind), where ind would be the index to sorted_set_of_times in that loop. Thus, as an example, if you were indexing into the rows of target using target_set in your original code like target(target_set,..), then you can do target(eqcond(:,ind),..) instead.

Approach #2

Reduced data approach -

vind = find(Targets.First_Time <= Targets.Last_Time); %//Indices to reduce Targets datasize
Targets.First_Time = Targets.First_Time(vind);
Targets.Last_Time = Targets.Last_Time(vind);
for current_time = sorted_set_of_times;
    target_set = vind([Targets.First_Time] <= current_time & [Targets.Last_Time] >= current_time);
end

Approach #3 (Hybrid of Approaches #1,2)

vind = find(Targets.First_Time <= Targets.Last_Time);
Targets.First_Time = Targets.First_Time(vind);
Targets.Last_Time = Targets.Last_Time(vind);
eqcond = bsxfun(@le,Targets.First_Time,sorted_set_of_times) & bsxfun(@ge,Targets.Last_Time,sorted_set_of_times);
[r,c] = find(eqcond);
for k =1:numel(sorted_set_of_times)
    target_set = vind(r(c==k));
end

Suggestion to use logical indexing (if applicable): On a similar discussion as stated for approach #1, you could perform logical indexing for this one too. Thus, using the same notations as for that discussion, instead of doing target(target_set,..), you can do target(vind(eqcond(:,ind)),..).

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Do `Targets.First_Time` and `sorted_set_of_times` have to be the same length here? – Samuel O'Malley Aug 04 '14 at 06:05
  • Woops, sorry. There are 10 million+ Targets, each with a First_Time and Last_Time. – Samuel O'Malley Aug 04 '14 at 06:10
  • @SamuelO'Malley Ops my bad! I am assuming `Targets.First_Time` to be a column vector and `sorted_set_of_times` to be a row vector and also they don't have to be of the same lengths! Please ignore my previous comment! – Divakar Aug 04 '14 at 06:10
  • No problem. Let me get back to you with the timing, it could still take a couple of hours to run through. Thanks for your help :) – Samuel O'Malley Aug 04 '14 at 06:17
  • @SamuelO'Malley Sure I would be rooting for this one! Keep in mind the row and column vector thing. – Divakar Aug 04 '14 at 06:18
  • 1
    @SamuelO'Malley Check out the just added `Approach #2`! – Divakar Aug 04 '14 at 07:12
  • Does the second approach reduce to the askers approach if `First_Time` is not greater than `Last_Time` for each target? – Dennis Jaheruddin Aug 04 '14 at 13:22
  • @DennisJaheruddin Well that approach basically narrows the window for both First_Time and Last_Time, because of the way those are compared against current_time and thus reduces datasize for both First_Time and Last_Time. – Divakar Aug 04 '14 at 13:33
1

I don't know specific MATLAB instructions, but you can try following, trading memory for CPU:

  1. Iterate over all targets as TARGET.
  2. Find min_time >= TARGET.First_Time and max_time <= TARGET.Last_Time - using binary search.
  3. Add TARGET to target_set for each time between min_time and max_time.
  4. Iterate times as before, but use precomputed target_sets.
kilotaras
  • 1,419
  • 9
  • 24
1

I was just looking at the problem description, and I think the use of an Interval Tree or KD Tree might be a better choice of data structure to the type of query you are doing.( Search based on ranges).

As in, let's say you build an interval tree with (startTime, EndTime), then you can execute the query on the ranges. Here's a very good thread on Stackoverflow about such range querying data structures

I don't know matlab, but it does seem to have support for KD-TREES. I couldn't find information about Interval trees in matlab, but here's a good tutorial on the same.

Community
  • 1
  • 1
Arvind
  • 466
  • 3
  • 9
1

I think that a brute force approach will never get much faster with these numbers. (Not by taking out structs or using logical indexing).

If you want a fast algorithm, I think the following can work:

  1. Sort current_time (already done)
  2. Sort target by First_Time
  3. Find for all targets the starting point S
  4. Sort by Last_Time
  5. Find for all targets the endpoint E

If you have two sorted lists you just need to loop through each once to find the relevant point, significantly reducing the complexity of the operation.

Now you can just call your function for each target and use current_time(S:E)

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122