3

Given a list of KeyValuePairs, where each pair has a getValue() method, what would be the fastest way to obtain a List (or Set) of unique Values?

All of the below produce acceptable result. u1 seems to be fastest over an expected list size (about 1000-2000 KVP)

Can we do better (faster)?

private static Set<String> u1(List<_KVPair> pairs) {
    Set<String> undefined = new HashSet<String>();

    for (_KVPair pair : pairs) {
        undefined.add(pair.getValue());
    }

    if (undefined.size() == 1) {
        return new HashSet<String>();
    }
    return undefined;
}

private static List<String> u2(List<_KVPair> pairs) {

    List<String> undefined = new ArrayList<String>();
    for (_KVPair pair : pairs) {
        if (!undefined.contains(pair.getValue())) {
            undefined.add(pair.getValue());
        }
    }

    return undefined;
}

private static List<String> u3(List<_KVPair> pairs) {

    List<String> undefined = new LinkedList<String>();

    Iterator<_KVPair> it = pairs.iterator();
    while (it.hasNext()) {
        String value = it.next().getValue();
        if (!undefined.contains(value)) {
            undefined.add(value);
        }
    }
    return undefined;
}

At about 3600 pairs, 'u3' wins. At about 1500 pairs, 'u1' wins

Brian
  • 17,079
  • 6
  • 43
  • 66
James Raitsev
  • 92,517
  • 154
  • 335
  • 470
  • What happens when you try it? Which has the lowest time complexity? – Peter Lawrey Sep 25 '12 at 14:39
  • It seems that the fist one is the fastest. – James Raitsev Sep 25 '12 at 14:39
  • .. and it has the lowest time complexity, O(N) vs O(N^2) – Peter Lawrey Sep 25 '12 at 14:40
  • I mainly asked because, depending on size of the list, i am seeing different results. At about 3600 pairs, 'u3' (consistently) wins. At about 1500 pairs, 'u1' (consistently) wins – James Raitsev Sep 25 '12 at 14:44
  • 1
    I would make sure you are running the tests for at least 2-5 seconds each, otherwise you get results which are not reproduce-able. – Peter Lawrey Sep 25 '12 at 14:49
  • u1 makes no sense to me at all, its not finding unqiue values in original, just making a set. – NimChimpsky Sep 25 '12 at 14:55
  • 1
    @NimChimpsky A set is a collection which contains only unique values. Typically, the fastest way to find unique values from a big collection of values is to add all values to a set, causing all duplicates to disappear since the add method for the set will simply ignore the input if the input already exists in the set. – Alderath Sep 25 '12 at 15:08
  • @NimChimpsky So do u2 and u3 - they add all values from original list, but only once if they are found more than once. – assylias Sep 25 '12 at 15:08
  • @Alderath Elements that were duplicated in original will still appear in final set, only once. But yes I agree obviously that a set removes duplication. If the string "tempstr" appears twice in original, it will appear once in final result. My point is it should be completed removed from final result. – NimChimpsky Sep 25 '12 at 15:10
  • @NumChimpsky, the OP is not asking for the input list to be modified. – matt b Sep 25 '12 at 15:25

6 Answers6

7

First option should be faster. You could possibly make it even faster by sizing the set before using it. Typically, if you expect a small number of duplicates:

Set<String> undefined = new HashSet<String>(pairs.size(), 1);

Note that I used 1 for the load factor to prevent any resizing.

Out of curiosity I ran a test (code below) - the results are (post compilation):

Test 1 (note: takes a few minutes with warm up)

size of original list = 3,000 with no duplicates:
set: 8
arraylist: 668
linkedlist: 1166

Test 2

size of original list = 30,000 - all strings identical:
set: 25
arraylist: 11
linkelist: 13

That kind of makes sense:

  • when there are many duplicates, List#contains will run fairly fast as a duplicate will be found more quickly and the cost of allocating a large set + the hashing algorithm are penalising
  • when there are no or very few duplicates, the set wins, by a large margin.
public class TestPerf {

    private static int NUM_RUN;
    private static Random r = new Random(System.currentTimeMillis());
    private static boolean random = false; //toggle to false for no duplicates in original list


    public static void main(String[] args) {

        List<String> list = new ArrayList<>();

        for (int i = 0; i < 30_000; i++) {
            list.add(getRandomString());
        }

        //warm up
        for (int i = 0; i < 10_000; i++) {
            method1(list);
            method2(list);
            method3(list);
        }

        NUM_RUN = 100;
        long sum = 0;
        long start = System.nanoTime();
        for (int i = 0; i < NUM_RUN; i++) {
            sum += method1(list);
        }
        long end = System.nanoTime();
        System.out.println("set: " + (end - start) / 1000000);

        sum = 0;
        start = System.nanoTime();
        for (int i = 0; i < NUM_RUN; i++) {
            sum += method2(list);
        }
        end = System.nanoTime();
        System.out.println("arraylist: " + (end - start) / 1000000);

        sum = 0;
        start = System.nanoTime();
        for (int i = 0; i < NUM_RUN; i++) {
            sum += method3(list);
        }
        end = System.nanoTime();
        System.out.println("linkelist: " + (end - start) / 1000000);

        System.out.println(sum);
    }

    private static int method1(final List<String> list) {
        Set<String> set = new HashSet<>(list.size(), 1);
        for (String s : list) {
            set.add(s);
        }
        return set.size();
    }

    private static int method2(final List<String> list) {
        List<String> undefined = new ArrayList<>();
        for (String s : list) {
            if (!undefined.contains(s)) {
                undefined.add(s);
            }
        }
        return undefined.size();
    }

    private static int method3(final List<String> list) {
        List<String> undefined = new LinkedList<>();

        Iterator<String> it = list.iterator();
        while (it.hasNext()) {
            String value = it.next();
            if (!undefined.contains(value)) {
                undefined.add(value);
            }
        }
        return undefined.size();
    }

    private static String getRandomString() {
        if (!random) {
            return "skdjhflkjrglajhsdkhkjqwhkdjahkshd";
        }
        int size = r.nextInt(100);
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < size; i++) {
            char c = (char) ('a' + r.nextInt(27));
            sb.append(c);
        }
        System.out.println(sb);
        return sb.toString();
    }
}
assylias
  • 321,522
  • 82
  • 660
  • 783
  • Actually this slows things down by about 30% in my test (about 3000 kvp) – James Raitsev Sep 25 '12 at 14:52
  • 1
    @Jam How do you run your test? How many times have you run it? How many duplicates? Also for 3000 values, the method will probably run in a few milliseconds so the precision of your clock could bias some results. – assylias Sep 25 '12 at 14:56
  • u1 just creates a set, it does not remove elemetns that were duplicated in original list. Just removes the duplication. – NimChimpsky Sep 25 '12 at 14:56
  • 2
    @Jam it sounds like you might be benchmarking your code poorly. You need to run the methods to test dozens or hundreds of times to avoid measuring any "warm-up" costs. See http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – matt b Sep 25 '12 at 15:27
  • @Jam also make sure KVPair implements equals() and hashCode() – matt b Sep 25 '12 at 15:30
  • @Jam I have edited with a benchmark - the number of duplicates in the list has a significant impact on which method performs best. – assylias Sep 25 '12 at 15:35
  • Here are the results of a similar benchmark I ran, with 10000 test rounds on a List of size 5000, with the KV pairs having about 1000 unique values: https://gist.github.com/3782694 Using List.contains() with these parameters is about 56 times slower. – matt b Sep 25 '12 at 15:47
  • @mattb similar results then (considering your use case has 1000 unique values). – assylias Sep 25 '12 at 15:58
2

Update: see edit below

There is no point in iterating over the list when you could just do

return new HashSet<_KVPair>(pairs)

The worst option is u2 and u3, where you are adding items in the first list to a second list and calling List.contains(item) on each iteration of the loop. This operation approaches O(n^2) - List.contains(item) needs to compare the item to potentially the entire list. Avoid algorithms where you need to iterate over a list and call some operation that also iterates over the list.

If you want unique items, use a Set. If you need this items in sorted order, use a TreeSet, otherwise 99% of the time you want a HashSet.

edit: I missed that you want to get a set of pair.getValue(); but the advice is the same regardless - use a Set, do not use List.contains() in a loop.

matt b
  • 138,234
  • 66
  • 282
  • 345
  • 2
    that doesn't find unqiue values, just makes a hashset from the list – NimChimpsky Sep 25 '12 at 14:40
  • You realize that the pairs list does not contain String objects, right? – Alderath Sep 25 '12 at 14:40
  • I am only interested in different values of the pair, not pairs themselves. Keys may as well be same, or different, does not matter. – James Raitsev Sep 25 '12 at 14:41
  • @Jam added an edit but the advice remains the same - use a Set. Adding a bunch of items to a Set in a loop versus constructing the set with `new HashSet(collection)` has the same complexity. – matt b Sep 25 '12 at 14:43
  • Typically, the hash of a key value pair would be the hash of the key, so the proposed solution would generate a set with unique keys, but not necessarily unique values. – Alderath Sep 25 '12 at 14:43
2

You will be able to speed up u1 by changing the first line to:

Set<String> undefined = new HashSet<String>(pairs.size());

As otherwise, the set will internally have to resize a lot as you are adding values.

beny23
  • 34,390
  • 5
  • 82
  • 85
  • 2
    This will resize once the set is 75% full. – assylias Sep 25 '12 at 14:46
  • 1
    Good point, though how relevant this is my well be down to how many unique values would be in the list... – beny23 Sep 25 '12 at 14:48
  • 1
    Well... It's always a tradeoff... If you have a million key value pairs with only a handful of unique values, this solution would be wasting a lot of space. Since the occasional needs for buffer resizes only add an amortized constant runtime to the add method, buffer resizes should not be too big of a problem. However, if the operation is extremely performance critical, I agree that guesstimating a required capacity is probably a good idea (as long as the guesstimate is based on a reasonable assumption about expected inputs). – Alderath Sep 25 '12 at 14:58
  • maybe doubling the size would avoid rehashing at all but will definitely waste memory. – Sednus Sep 25 '12 at 15:01
  • you can avoid resizing with `new HashSet(size, 1)` - which you should only use when you can guarantee the size of the set – matt b Sep 25 '12 at 15:24
  • @mattb Wouldn't it be a better idea then to do `new HashSet((int) (size / 0.75), 0.75f);`? Loading a hash set to a load factor of 1 does not seem like a good idea... – Alderath Sep 26 '12 at 11:57
  • 1
    @Alderath why does it seem like a bad idea? If you know that the set only needs to contain N items, I don't think you need to really spend space on making sure it can sure the N+1th item effectively – matt b Sep 26 '12 at 13:33
  • @mattb I do not have any great reasons. Just gut feeling. It feels as if it would be a good idea to have a load factor of 1, that would be the default implementation (instead of 0.75). With a load factor of 1, the last couple of add operations are almost guaranteed to cause collisions. But I think we are both overanalyzing. Either way is probably fine. In fact, I believe that simply not definining any initial capacity and letting the hash set grow dynamically is also fine in almost all cases. – Alderath Sep 26 '12 at 14:58
1

I daresay that option 1 is fastest and most clean. It is difficult to beat hash set in terms of checking whether value already contained there.

List based solution does not scale as said in previous answer

Konstantin Pribluda
  • 12,329
  • 1
  • 30
  • 35
1

Another method could be Sort list then in one loop you can eliminate duplicates by keeping reference of last element added if reference is equal don't add to new list other wise add

Collections.sort(pairs)//O(n log n)

Loop
if(!lastAdded.equals(pairs.get(i)))
 {
   //Add to list 
   //change lastAdded
 }
Amit Deshpande
  • 19,001
  • 4
  • 46
  • 72
  • 1
    Tried that. Depending on size, this slows things down substantially. sorting large list takes time – James Raitsev Sep 25 '12 at 14:54
  • Then you will have to implement custom list which also contains hashmap and implement method which will avoid duplicates by just returning elements from each bucket. – Amit Deshpande Sep 25 '12 at 14:59
-1

None of the given answers remove duplicates from the final result, they just remove the duplication. So if a string is present twice it will still be present in the final result, but just the one time. If thats not required, well yeah I have just wasted five minutes ...

 public Map<String, String> countOccurences(List<String> source){
       Map<String, Integer> result =   new HashMap<>(source.size());
        int temp =0;
        for (String value : source) {
            if(result.containsKey(value)){
                temp = result.get(value);
                temp++;
                result.put(value, temp);
                temp = 0;
            }
            else {
                result.put(value, 1);
            }
        }
    }
    public List<String> sublistSingles(Map<String, Integer> results){
        List<String> duplicatesRemoved = new ArrayList<>(results.size());
        for(Map.Entry<String, Integer> result:results.entrySet()){
            if(result.getValue().equals(1)){
              duplicatesRemoved.add(result.getKey());
            }
        }
        return duplicatesRemoved;
    }
NimChimpsky
  • 46,453
  • 60
  • 198
  • 311
  • I don't think that's what the original code does. You return a list of all values that appear only once in the original list whereas the OP's code returns a list of unique values, including those that appear more than once (in other words it removes duplicates). – assylias Sep 25 '12 at 15:07
  • @assylias exactly my point, the code does not "find unique values in the list". It simply makes a set of unqiue elements. – NimChimpsky Sep 25 '12 at 15:12
  • 1
    I agree but it looks like what the op wants is a set of unique elements. This is admittedly not clear. – assylias Sep 25 '12 at 15:16
  • I guess this is a question of bad and ambiguous requirements/specifications. Either, you can interpret the question as "get a list of all the values from the original list that only occur once in the original list" or you can interpret it as "get a list of all the values that occur in the original list, and ensure that each value occurs only once in this output list". It is unclear whether the unique refers only to the output or to the input. – Alderath Sep 25 '12 at 15:16
  • @Alderath the phrase "find unique values in the list" is pretty un-amibiguous and clear to me. But yes the code does something different. – NimChimpsky Sep 25 '12 at 15:20
  • 1
    @NimChimpsky But that was not the way the question was phrased. The question was "...what would be the fastest way to obtain a List (or Set) of unique Values?". Strictly speaking, I would even argue that `return Arrays.asList(1, 2);` could be an answer to the question, because that is a list of unique values. The original question does not even mention that the obtained unique values need to be a part of the original list of key value pairs (this constraint is merely implicitly implied). Hence, like I said, unclear specifications... – Alderath Sep 25 '12 at 15:26
  • @NimChimpsky I realize now that I was looking at the question (which IMO sounds more like it is asking the second alternative mentioned in my previous comment) and you were looking at the question title (which quite clearly supports your interpretation). I still think this is a question of unclear requirements though... The OP might even not be sure of what he wants himself... – Alderath Sep 25 '12 at 15:32