0

I tried to use HashSet to remove the duplications from an ArrayList<StringBuilder>.

E.g. Here is an ArrayList, each line is a StringBuilder object.

"u12e5 u13a1 u1423"
"u145d"
"u12e5 u13a1 u1423"
"u3ab4 u1489"

I want to get the following:

"u12e5 u13a1 u1423"
"u145d"
"u3ab4 u1489"

My current implementation is:

static void removeDuplication(ArrayList<StringBuilder> directCallList) {
    HashSet<StringBuilder> set = new HashSet<StringBuilder>();
    for(int i=0; i<directCallList.size()-1; i++) {
        if(set.contains(directCallList.get(i)) == false)
            set.add(directCallList.get(i));
    }   
    StringBuilder lastString = directCallList.get(directCallList.size()-1);
    directCallList.clear();
    directCallList.addAll(set);
    directCallList.add(lastString);
} 

But the performance becomes worse and worse as the ArrayList size grows. Is there any problem with this implementation? Or do you have any better ones in terms of performance?

JackWM
  • 10,085
  • 22
  • 65
  • 92
  • What happened when you used the hashset? – Pradeep Pati Oct 15 '12 at 16:58
  • Hashset by definition contains unique entries – pankar Oct 15 '12 at 16:59
  • 1
    Is the a real need to store StringBuilders? Could you store just the Strings they contain? – Shaded Oct 15 '12 at 17:00
  • @Shaded Due to this http://stackoverflow.com/questions/12899953/in-java-how-to-append-a-string-more-efficiently, I have to ... – JackWM Oct 15 '12 at 17:01
  • 1
    First, by calling `contains` and `add` you have doubled the work being done because `Set.add` does the `contains` check internally. That is the point of using a `Set` – John B Oct 15 '12 at 17:01
  • Why do you need an ArrayList of StringBuilders? That just seems odd. – ChadNC Oct 15 '12 at 17:04
  • The problem you're having is that StringBuilders aren't compared based on their contents, instead it's only based on memory location. So 2 StringBuilders with the same contents still won't show as equal. You'll need to do some manipulation to check the contents of the StringBuilder using .toString() to see if they are equal... I'll write up an answer if my build keeps taking so long. – Shaded Oct 15 '12 at 17:04
  • Is performance really an issue here? Have you checked for correctness? – Code-Apprentice Oct 15 '12 at 17:23

5 Answers5

9

StringBuilder doesn't implement equals() or hashcode(). Two StringBuilders are only equal if they are the exact same object, so adding them to a HashSet won't exclude two different StringBuilder objects with identical content.

You should convert the StringBuilders to String objects.

Also, you should initialize your HashSet with an "initial capacity" in the constructor. This will help with the speed if you are dealing with large numbers of objects.

Lastly, it's not necessary to call contains() on the hashset before adding an object. Just add your Strings to the set, and the set will reject duplicates (and will return false).

Sam Barnum
  • 10,559
  • 3
  • 54
  • 60
2

Let's analyze your method to find where we can improve it:

static void removeDuplication(ArrayList<StringBuilder> directCallList) {
    HashSet<StringBuilder> set = new HashSet<StringBuilder>();
    for(int i=0; i<directCallList.size()-1; i++) {
        if(set.contains(directCallList.get(i)) == false)
            set.add(directCallList.get(i));
    }

This for loop repeats once for each element in the ArrayList. This seems unavoidable for the task at hand. However, since HashSet can only contain one of each item, the if statement is redundant. HashSet.add() does the exact same check again.

    StringBuilder lastString = directCallList.get(directCallList.size()-1);

I don't understand the need to get the lastString from your list and then add it. If your loop works correctly, it should have already been added to the HashSet.

    directCallList.clear();

Depending on the implementation of the list, this can take up to O(n) time because it might need to visit every element in the list.

    directCallList.addAll(set);

Again, this takes O(n) time. If there are no duplicates, set contains the original items.

    directCallList.add(lastString);

This line seems to be a logic error. You will add a String which is already in the set and added to directCallList. }

So overall, this algorithm takes O(n) time, but there is a constant factor of 3. If you can reduce this factor, you can improve the performance. One way to do this is to simply create a new ArrayList, rather than clearing the existing one.

Additionally, this removeDuplication() function can be written in one line if you use the correct constructors and return the ArrayList without duplicates:

static List<StringBuilder> removeDuplication(List<StringBuilder> inList) {
    return new ArrayList<StringBuilder>(new HashSet<StringBuilder>(inList));
}

Of course, this still doesn't address the issues with StringBuilder that others have pointed out.

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
  • Since StringBuilder does not override `equals`, his method or the changes you propose would not work. – assylias Oct 15 '12 at 17:13
  • 1
    @assylias I was answering the question about performance and assumed that the OP had already checked the algorithm for correctness. My bad! – Code-Apprentice Oct 15 '12 at 17:24
1

So you had some other options, but I like my solutions short, simple, and to the point. I've changed your method to no longer manipulate the parameter, but rather return a new List. I used a Set<String> to see if the contents of each StringBuilder was already included and returned the unique Strings. I also used a for each loop instead of accessing by index.

static List<StringBuilder> removeDuplication(List<StringBuilder> directCallList) {
    HashSet<String> set = new HashSet<String>();
    List<StringBuilder> returnList = new ArrayList<StringBuilder>();
    for(StringBuilder builder : directCallList) {
        if(set.add(builder.toString())
            returnList.add(builder);
    }   
    return returnList;
} 
Shaded
  • 17,276
  • 8
  • 37
  • 62
0

As Sam states, StringBuider does not override hashCode and equals and so the Set will not work appropriately.

I think the answer is to wrap the Builder in an object that executes toString only once:

class Wrapper{
   final String string;
   final StringBuilder builder;

   Wrapper(StringBuilder builder){
      this.builder = builder;
      this.string = builder.toString();
   }

   public int hashCode(){return string.hashCode();}

   public boolean equals(Object o){return string.equals(o);}
}     


 public Set removeDups(List<StringBuilder> list){
    Set<Wrapper> set = ...;
    for (StringBuilder builder : list)
       set.add(new Wrapper(builder));

    return set;
 }

The removeDups method could be updated to extract the builders from the set and return a List<StringBuilder>

John B
  • 32,493
  • 6
  • 77
  • 98
0

As explained, StringBuilders don't override Object#equals and aren't Comparable.

Although using StringBuilders to concatenate your Strings is the way to go, I would suggest that once you are done with your concatenation, you should store the underlying strings (stringBuilder.toString()) instead of the StringBuilders in your list.

Removing duplicates then becomes a one line:

Set<String> set = new HashSet<String>(list);

Or even better, store the strings in the set directly if you don't need to know that there are duplicates.

assylias
  • 321,522
  • 82
  • 660
  • 783