6

I read about how TreeSet is slower than HashSet, (that adding elements into a TreeSet is slower) so i made a performance test, i'm trying to find out if it's better to add elements to HashSet and then move them into a TreeSet or put them in there in the first place. It looks like that inserting elements into a HashSet is faster but only when i insert a large amount of elements, why? I've read that if i don't need the elements sorted, always use HashSet, but apparently, sometimes it's slower.

When I insert a concrete value("1") instead of random numbers, TreeSet is also faster because there is no sorting, so how can i know when to use HashSet or TreeSet?

And my second question, when i create TreeSet like this, why don't i have access to "NavigableSet" methods?

Set<Integer> treeSet = new TreeSet<Integer>();     //cant type treeSet.lower(e: E)
TreeSet<Integer> treeSet = new TreeSet<Integer>(); //can  type treeSet.lower(e: E)

Thanks for helping me with this.

Here are the results:

5 000 000 (random numbers)

enter image description here

5 000 000 (numbers "1")

enter image description here

500 000 (random numbers)

enter image description here

50 000 (random numbers)

enter image description here

Here is my code:

package performancetest;

import java.text.DecimalFormat;
import java.util.HashSet;
import java.util.InputMismatchException;
import java.util.Scanner;
import java.util.TreeSet;

public class HashSet_vs_TreeSet {
private static DecimalFormat df = new DecimalFormat("#.#####");
private static double hashTime, treeTime, hashToTreeTime;
private static int numbers = 0;

public static void main(String[] args){
    start();
    hashSetTest();
    treeSetTest();
    fromHashToTreeSet();
    difference();
}

/**
 * in this method, instead of "System.exit(1)" i try'ed "start()" but when i did lets say three mistakes, after correct input the 
 * methods hashSetTest();treeSetTest();fromHashToTreeSet();difference(); were running 4 times... i made this method exactly for the
 * purpose to repeat itself in case of a input mistake.
 */
public static void start(){
    System.out.print("Enter a number(a advise bigger or equal to 50 000): ");
    Scanner scan = new Scanner(System.in);
    try{
        numbers = scan.nextInt();
    }catch(InputMismatchException e){
        System.out.println("ERROR: You need to enter a number");
        System.exit(1);
    }
    System.out.println(numbers + " numbers in range from 0-99 was randomly generated into " +
            "\n1)HashSet\n2)TreeSet\n3)HashSet and then moved to TreeSet\n");
}

public static void hashSetTest(){
    /**
     * i try'ed HashSet<Integer> hashSet = new HashSet<Integer>();
     * but changing the load factor didn't make any difference, what its good for then ?
     */
    HashSet<Integer> hashSet = new HashSet<Integer>(5000,(float) 0.5);
    double start = System.currentTimeMillis() * 0.001;

    for (int i = 0; i < numbers; i++) {
        hashSet.add((int)(Math.random() * 100));
    }

    hashTime = (System.currentTimeMillis() * 0.001) - start;

    System.out.println("HashSet takes " + df.format(hashTime) + "s");
}

public static void treeSetTest(){
    TreeSet<Integer> treeSet = new TreeSet<Integer>();
    double start = System.currentTimeMillis() * 0.001;

    for (int i = 0; i < numbers; i++) {
        treeSet.add((int)(Math.random() * 100));
    }

    treeTime = (System.currentTimeMillis() * 0.001) - start;

    System.out.println("TreeSet takes " + df.format(treeTime) + "s");
}

public static void fromHashToTreeSet(){
    HashSet<Integer> hashSet = new HashSet<Integer>();
    double start = System.currentTimeMillis() * 0.001;

    for (int i = 0; i < numbers; i++) {
        hashSet.add((int)(Math.random() * 100));
    }

    TreeSet<Integer> treeSet = new TreeSet<Integer>(hashSet);
    hashToTreeTime = (System.currentTimeMillis() * 0.001) - start;

    System.out.println("TreeSet from HashSet takes " + df.format(hashToTreeTime) + "s");
}

public static void difference(){
    double differenceSec = 0;
    double differenceTimes = 0;

    if(hashTime < treeTime){
        differenceSec = (treeTime - hashTime);
        differenceTimes = (treeTime / hashTime);
        System.out.println("\nHashSet takes " + df.format(differenceSec) + "s less then TreeSet, it is " + df.format(differenceTimes) + " times faster");
    }else{
        differenceSec = (hashTime - treeTime);
        differenceTimes = (hashTime / treeTime);
        System.out.println("\nTreeSet takes " + df.format(differenceSec) + "s less then HashSet, it is " + df.format(differenceTimes) + " times faster");
    }
}
}
Peter
  • 409
  • 2
  • 6
  • 14
  • You should generate the same random numbers for both datatypes at each run. Having it completely random is of course good, but when comparing stuff like this you probably want to compare it on the same insertion data. – keyser Apr 19 '14 at 09:52
  • Hashing, in most cases, comes out considerably faster – Rogue Apr 19 '14 at 11:05
  • 2
    Note that the **fromHashToTreeSet** test is completely broken. Because you add only have 100 different values in the set (`Math.random() * 100`), it's only copying 100 values to the `TreeSet` from the `HashSet`, rather than 5 million. That's the only reason why this method appears faster than `treeSetTest` - in reality copying from a HashSet to a TreeSet *is not* faster – Erwin Bolwidt Apr 19 '14 at 11:09
  • @ErwinBolwidt: Note also that the benchmark is broken, too (no warmup, nothing) and so is the the way the random numbers are generated. – maaartinus Apr 19 '14 at 17:32

3 Answers3

3

Well, when you talk about peformance of TreeSet and HashSet you should clearly understand how these structures are organized what consequences of its organization.

Typically TreeSet is a structure where all elements are organized in a binary tree. Thus adding a member or accessing it is ~O(log(N)).

In other hand HashSet is a structure similar to an array. The difference is that in an array index is an unique number, while in a HashSet each key needs to be translated into index with the help of a hash function. A hash function may produce the same results for different input data, the situation is called hash collision. A good hash function (yes, they could be bad and good) produces as many unique results on a given set of input data as possible.

So accessing data in a hash set costs calculations of a hash function (in Java usually this is .hashCode()) and possible conflict resolution. That is its estimated as O(1) i.e. a constant number of operations.

You should understand that O(1) is not always less than O(log(n)), it's less asymptotically and on big enough n. Also a proper choice of a hash function does matter.

user3159253
  • 16,836
  • 3
  • 30
  • 56
2

0) JVM benchmarking is really complicated. Almost always you're measuring not what you're thinking you are measuring. There's http://openjdk.java.net/projects/code-tools/jmh/ for microbenchmarking from guys from Oracle. And you may try some benchmarking frameworks and guides.

JIT compiler warmup, initial memory allocation, garbage collector and a LOT of other things may invalidate your benchmark.

1) See also Hashset vs Treeset regarding your first question

2) Set<Integer> treeSet = new TreeSet<Integer>(); //cant type treeSet.lower(e: E)

That's how it works. You declare treeSet as Set. Set does not extends NavigableSet. You may explicitly cast if you want to. But if you want to access NavigableSet methods why wouldn't you declare treeSet as NavigableSet

Set<Integer> treeSet = new TreeSet<Integer>(); 
Integer lower = ((NavigableSet) treeSet).lower(); // thus should work
Community
  • 1
  • 1
Mikhail Boyarsky
  • 2,908
  • 3
  • 22
  • 37
0

Try to run this code. I took it from codeforces.ru. This is the demonstration of how HashSet/HashMap may work. It took 1.3 - 1.4 sec to add 10^5 values. According to linked topic - shuffling won't help (I didn't tried). TreeSet will never show such terrible perfomance.

import java.io.*;
import java.util.*;
import static java.lang.Math.*;

public class Main implements Runnable {

    public static void main(String[] args) {
        new Thread(null, new Main(), "", 16 * (1L << 20)).start();
    }

    public void run() {
        try {
            long t1 = System.currentTimeMillis();
            solve();
            long t2 = System.currentTimeMillis();
            System.out.println("Time = " + (t2 - t1));
        } catch (Throwable t) {
            t.printStackTrace(System.err);
            System.exit(-1);
        }
    }

    // solution

    int hashinv(int h) {
        h ^= (h >>> 4) ^ (h >>> 7) ^ (h >>> 8) ^ (h >>> 14) ^ (h >>> 15)
                ^ (h >>> 18) ^ (h >>> 19) ^ (h >>> 20) ^ (h >>> 21)
                ^ (h >>> 23) ^ (h >>> 26) ^ (h >>> 28);
        return h;
    }

    int bitreverse(int h) {
        int res = 0;
        for (int i = 0; i < 31; i++)
            if ((h & (1 << i)) != 0)
                res |= (1 << (30 - i));
        return res;
    }

    void solve() throws IOException {
        final int size = 100000;
        int[] s = new int[size];
        for (int i = 0, val = 0; i < size; i++) {
            s[i] = Integer.MAX_VALUE;
            while (s[i] > 1000000000)
                s[i] = hashinv(bitreverse(val++));
        }
        long startTime = System.currentTimeMillis();
        HashSet<Integer> h = new HashSet<Integer>(size);
        for (int i = 0; i < size; i++)
                h.add(s[i]);
        System.out.println("HashSet adding time = " + (System.currentTimeMillis() - startTime));
    }

}
Ralor
  • 371
  • 1
  • 3
  • 10
  • The title of linked article recommends against using `HashSet` for integer, and this is IMHO a pure non-sense. You can produce hash collisions with any kind of object. Moreover, the code looks rather complicated (and pretty rather crappy) which shows that chances to run into problems by accident are really low. Finally, Java 8 solves hash collisions nicely. – maaartinus Apr 19 '14 at 17:27
  • @maaartinus Author said that writing took ~15 mins. And such codes usually being used for hacks. Anyway, plz give me the link about such fix in java 8. There was just a comment in this CF topic that it wasn't fixed in java 8, with this link http://hg.openjdk.java.net/jdk8/jdk8/jdk/rev/43bd5ee0205e – Ralor Apr 19 '14 at 17:58
  • Hash collisions can indeed be [used for DOS attack](http://www.ocert.org/advisories/ocert-2011-003.html), but this doesn't mean that it can happen by accident. For Java 8 just UTFG or use the [source](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/HashMap.java#l145). AFAIK, the fix you linked was used in both Java 6 and 7, but (unlike the tree bins) it didn't really solve the problem. – maaartinus Apr 19 '14 at 18:41
  • [This paper](http://webcache.googleusercontent.com/search?q=cache:qx1I0kMKrMMJ:https://131002.net/siphash/siphashdos_appsec12_slides.pdf) shows why it didn't help. – maaartinus Apr 19 '14 at 18:49
  • @maaartinus thx. I changed "can work" to "may work". Also, I posted this answer to inform topic starter about such things and give some concrete example. – Ralor Apr 19 '14 at 18:50
  • I didn't object what you wrote (it's fine), but it can't be an explanation for this question (the explanation is that the OP's benchmark is broken and not worth further comments besides linking to JMH). I also objected the site you linked. And finally, I wanted to say that the attack doesn't work anymore. – maaartinus Apr 19 '14 at 18:55
  • @maaartinus I've read "Art of intrusion" by K.Michnik, and now I'm really not sure about "doesn't work anymore"))) 100% solutions (hacks/fixes/pachces/...) doesn't exist, that's the only fact I know) – Ralor Apr 19 '14 at 19:09
  • 1
    There ain't no silver bullet, but 1. The original attack stopped working after switching to Murmur, 2. no other attack works with Java 8 unless you can force the server to put non-`Comparable` items in the `HashMap`, 3. there are other hash-using maps, which stays vulnerable, e.g., Guava's `ImmutableMap`. – maaartinus Apr 19 '14 at 19:41