What is the best (performance + memory) between String and HashSet for checking duplicates

Question

I want to do a simple implementation to do some operations based on a distinct Codes (aCode) among the bigCodeList contains duplicates. Below I have mentioned two approaches what i want to know is what is the more effective one among them in performance vice + memory consumption wise?

Approach 1 :

    String tempStr = "";

    for(String aCode : bigCodeList){
        if(tempStr.indexOf(aCode) == -1) {
            // deal With the aCode related work
            tempStr += aCode+"-"
        }
    }

Approach 2 :

        HashSet<String> tempHSet = new HashSet<String>();

        for(String aCode : bigCodeList){

            if(tempHSet.add(aCode)){

                // deal With the aCode related work

            }

        }

Note : aCode is a Three Letter code like LON

In general, the second approach should be preferred. In case you have multiple approaches for a problem and you are not sure which one to prefer, you should run some simple tests to get a rough idea of which one is better. — George, Jun 06 '13 at 08:39
@George: tests are good, but they can only tell you "better" in terms of speed or memory. And even that [can easily go wrong](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java). Stylistic/stability problems are rarely shown by those. — Joachim Sauer, Jun 06 '13 at 08:40
@JoachimSauer I agree, one has to be careful about how to set them up and how to interpret the results, especially in languages such as java. If possible, it's better to theoretically analyze the approaches. I suggest a third approach: Tries (http://en.wikipedia.org/wiki/Trie). They should have the best theoretical performance in terms of speed and relatively good performance in terms of memory consumption. — George, Jun 06 '13 at 08:48
I created a test to check the _speed_ of these: http://pastebin.com/43yqjvjr The `HashSet` one is much slower, at least on my computer. — PurkkaKoodari, Jun 06 '13 at 09:00
@Pietu1998: you're using `!= -1` where you should be using `== -1`, which results in `tempStr` always staying empty! (So the String solution effectively does nothing). Also: please read the link I provided above about microbenchmarks, because they are easy to get wrong. — Joachim Sauer, Jun 06 '13 at 09:10
@JoachimSauer Ok, fixed that but it was directly copied from the post (didn't look at it a lot). It was like that 6 hours ago http://stackoverflow.com/posts/16957235/revisions — PurkkaKoodari, Jun 06 '13 at 15:05
@Pietu1998: and once you fixed it, what does the performance look like? — Joachim Sauer, Jun 06 '13 at 15:13
@JoachimSauer Yep, on 100,000 codes it was 1515 ms and 7 ms :) — PurkkaKoodari, Jun 06 '13 at 15:18
@Pietu1998 & JoachimSauer : Yes, just after I posted it, I noticed that != is incorrect so I corrected it in that instant. But it is long time earlier than Pietu1998's comment. However there's my bad too. Appology — ironwood, Jun 07 '13 at 04:44

score 7 · Accepted Answer · answered Jun 06 '13 at 08:35

Approach 2 is by far better. You should not even consider approach 1.

First of all, approach 1 has linear time in searching. That means that when tempStr becomes twice as long, the time to search it becomes twice as long (on average, of course, if you always find the first element, it stays short).

Next: you're copying the entire tempStr each time your appending to it (because String objects are immutable and that's the only way to create a new one from an existing one). So the adding option takes ages as well.

Third (not a performance concern): Mixing data (aCode) and meta-data (the separator -) like this leads to all kinds of undesired effects. You might be sure that now aCode can never contain a dash, but what if that changes in two weeks?

Fourth: HashSet is built for pretty much exactly this use case! That's what it does best: hold a set of distinct objects, check if it already exists and add a new one.

Thank you very much Joachim. Just another silly question. What about using StringBuilder for tempStr (because it is mutable). Will it fix the Second issue? Even-though will it not good enough than HashSet?(Considering memory + performance). Your current answer is also enough for my work. But this is just for the curiosity. — ironwood, Jun 06 '13 at 08:46
Yes, using a `StringBuilder` would avoid the second issue. I would *still* not suggest this approach, but it would be less bad ;-) — Joachim Sauer, Jun 06 '13 at 09:09

score 1 · Answer 2 · answered Jun 06 '13 at 08:36

I think, that the first one approach is worse: indexOf operation has O(n), while for HashSet complexity could be O(1) for unique String keys look-up.

Furthermore, in the first approach you are using string concatenation operation, which implies new String object creation each time, which gives additional performance draw.

score 0 · Answer 3 · answered Jun 06 '13 at 08:34

0

A java.util.Set will not allow duplicates, but it's fairly "quiet" about rejecting duplicates.

answered Jun 06 '13 at 08:34

Holger

496
3
8

2

Well, it *does* return `false` from `add()` when trying to add a duplicate (which the OP seems to know). – Joachim Sauer Jun 06 '13 at 08:36

score 0 · Answer 4 · answered Jun 06 '13 at 08:47

0

Performance and memory wise Hashset is best than String to use in ur coding.

Appending values into string variable will take time

answered Jun 06 '13 at 08:47

Haseena

53
1
7

What is the best (performance + memory) between String and HashSet for checking duplicates

4 Answers4