0

So I've got a program that contains a bunch of records in a Set. The set could have a few items or potentially hundreds of thousands. One bit of data each record has is a timestamp. I need to eliminate all items in a set but one that are within 15 seconds of each other. What is the most efficient way to do that?

Currently I create a duplicate of the set. Then I iterate through the set comparing the first item to every other item, and repeat. If a match is found that is within 15 seconds, I remove it from the duplicate set. Then the duplicate set is written out to a file.

Obviously this works but I finally realized that this is ridiculously inefficient. For large sets this seems to take a crazy long time, assuming it's not some other problem occurring. Can someone provide me with a smarter, faster, more efficient (or just a proper) way to do this in Java? I'm realizing since the records contain timestamps that sorting them would probably help tremendously. I'd like to keep this all contained within the program though so I guess I need to look into sorting and Comparators.

I just can't quite wrap my head around the problem. I've come up with some other thoughts to improve my code but I can't help but feeling I'm still coming at this entirely wrong. Thanks for any suggestions.

Oh, and this is for work, not school or anything so any help is appreciated.

lbalazscs
  • 17,474
  • 7
  • 42
  • 50
cardician
  • 2,451
  • 3
  • 26
  • 36
  • 1
    for sure - get the list sorted - then you can make a single pass. – Randy Jan 11 '13 at 19:29
  • Why don't you build a `Map` with timestamps as keys and a list of matching items as values? – fge Jan 11 '13 at 19:45
  • Also, what is this "first element"? Do you use a LinkedHashSet? – fge Jan 11 '13 at 19:54
  • @fge A map with the timestamp as a key doesn't help me at all and what difference does it make what the first element is? And no, I'm using a TreeSet now for sorting purposes. – cardician Jan 11 '13 at 20:18
  • @cardician see my answer: having a Map can help a lot. – fge Jan 11 '13 at 20:29

4 Answers4

5

Right now, the algorithm you've described runs in O(n2) time.

Now, If you needed a faster algorithm, what you can do is

  • Sort your collection (I'd be surprised if java didn't have a base-class sorting function) O(n * lg(n))
  • All "matches" within 15 seconds of each other are going to be right next to each-other
  • You need only iterate through each element once checking only adjacent elements O(n)

If you do that, than your algorithm could be a much more manageble O(n * lg(n)) time complexity

Here's some information regarding Java's Array.sort()

  • I appreciate the info on sort times, but I need more info. I can't figure out how, even searching on a sorted list, it's all that much faster. Sure I can break out of the loop once I've moved outside of the 15 second window, but then for the next item I have to move through the entire list until I reach it's place in the list and then search from there. I don't know how to do that without still making comparisons through the list which as far as I can see doesn't save me any time. I know I'm just not getting it. – cardician Jan 11 '13 at 20:16
  • @cardician Instead of comparing each item to every other item, You only have to compare it to items **Immediately before** and **immediately after** to know if it's 15 seconds away from another item. That saves you a lot of comparing – Sam I am says Reinstate Monica Jan 11 '13 at 20:26
  • Thanks. With your answer and a little more from lbalazscs I gained enough understanding to properly implement things. It's so much faster than my method now that it's just sad how poorly I coded it. – cardician Jan 14 '13 at 15:09
1

You can continue using a Set, just make sure that it is sorted from the very beginning, like TreeSet (or ConcurrentSkipListSet if you have multiple threads). Either you implement Comparable so that the timestamps are compared, or you supply a Comparator that does the same.

This will guarantee that you cannot have duplicates (like you had it until now), and also simplify your code. Inserting into a TreeSet will also cost you O(n log n) time.

From here on you can continue with the approach suggested by Sam I am: the iterator will traverse it in ascending element order, you need to compare each element only with the previous one and the next one.

BTW, you don't need to copy everything to another Set, just make sure to use the remove method of the iterator, not the remove of the TreeSet: Iterating through a Collection, avoiding ConcurrentModificationException when removing in loop

Community
  • 1
  • 1
lbalazscs
  • 17,474
  • 7
  • 42
  • 50
  • Thank you very much for that. You helped make things a little clearer for me and I was able to properly implement it this way now. It is unbelievably faster than my method. – cardician Jan 14 '13 at 15:07
0

If you have a map, say:

Map<Long, List<MyClass>> map;

where the key is the timestamp, then you can do this:

// Value of wanted elements
List<MyClass> ret = new ArrayList<MyClass>();

// Go over all timestamps: if a timestamp is wanted, add all
// corresponding elements
for (Map.Entry<Long, List<MyClass>> entry: map.entrySet())
    if (wanted(entry.getKey()))
        ret.addAll(entry.getValue());

// Return
return ret;
fge
  • 119,121
  • 33
  • 254
  • 329
0

I haven't tested for performance, but one way I might implement this is to create a Set and override the equals() method for the object types in question.

public boolean equals( Object o )
{
  return( Math.abs( this.getTimestampSeconds() - o.getTimestampSeconds() ) < 15 );
}

By doing this, when you add each row to the set, you will only end up with one entry for any given 15 second time slice.

* EDIT **

I would not perform this override for a regular domain object. I probably would only do this in a facade object of some sort -- which is created solely for this purpose.

Also, as others have said. This presumes that your input list is sorted by ascending timestamp.

mightyrick
  • 910
  • 4
  • 6