4

I have a program which simply removes duplicate elements of a character array using HashSet.

Here is my program:

import java.util.Arrays;
import java.util.HashSet;

import java.util.Set;

public class MainClass {
    public static void main(String[] arg) {
        double sT = System.nanoTime();
        Character[] data = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
                'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
                'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
                'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
                'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f',
                'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
                's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd',
                'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
                'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b',
                'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
                'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
                'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
                'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
                'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
                'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v',
                'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
                'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't',
                'u', 'v', 'w', 'x', 'y', 'z' };

        Set<Character > uniqueSet = new HashSet<Character>(Arrays.asList(data));

         Character[] strArr = new Character[uniqueSet.size()];
         uniqueSet.toArray(strArr);

            for(Character str:strArr){
                System.out.println(str);
            }


        System.out.println(System.nanoTime() - sT);

    }

}

It gives the desired output. But problem is execution time. Is there any ways that I can implement in my program to reduce its execution time?

Suneeta Singh
  • 272
  • 3
  • 16
  • 1
    How is it slow? It looks perfectly reasonable... (except that maybe you want to allocate your char array _before_ the first `.nanoTime()`) – fge Jan 17 '13 at 11:35
  • 1
    ..and don't include the printing in your timing... – Nim Jan 17 '13 at 11:36
  • @fge yeah it is. But the time it is taking is 1319709.0 nano sec. in average. However, I need it to take no more than 90000 nano sec. – Suneeta Singh Jan 17 '13 at 11:37
  • 2
    Timing _one_ execution does not help either. Note that there is the JIT kicking in etc which analyzes your code, etc. Don't time the first execution! – fge Jan 17 '13 at 11:38
  • Your code seems fine to me. You can go thru data[] array, fill uniqueSet and write results to screen in one combined for loop. But it will save just a minimum time. – Al Kepp Jan 17 '13 at 11:38
  • @Nim I excluded printing but it did not make much difference. :( – Suneeta Singh Jan 17 '13 at 11:39
  • 90000 nano secs regardles to the length of `data` array? – Archer Jan 17 '13 at 11:39
  • Your algorithm's complexity is actually O(1), and that's good. You parse approximately 200 data items in 1319 us, that's approximately 6.5 us per item. I would really like to know why do you think that is slow. – Al Kepp Jan 17 '13 at 11:41
  • Why the conversion back to an array before printing? That appears to be redundant(?) – Nim Jan 17 '13 at 11:47

4 Answers4

5

As the different types of elements that you can have is really small, you can easily use a simple array instead of a hashset(an approach similar to set or counting sort). If you only care for the non-capital English letters declare an array boolean met[26];, if you need to be able to support all characters use an boolean met[256];.

Than iterate over the array and only add a character to the result if its met value is false. When adding the character to the result don't forget to mark it as used.

No hashing involved and thus - better performance.

EDIT: as it seems there is some confusion with what I mean I will try to add a code sample

boolean met[] = new boolean[256]; // Replace 256 with the size of alphabet you are using
List<Character> res = new ArrayList<Character>();
for(Character c:data){
  int int_val = (int)c.charValue();
  if (!met[int_val]) {
     met[int_val] = true;
     res.add(c);
  }
}

// res holds the answer.
Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
  • Character needs 16 bits, at least. And there are only 200 items in the input data array. So I bet your approach will be slower than hash set, because you need to go thru those 65536 items when you want to write the output to screen. But if OP asked for 26 English letters only, that would change the things dramatically. – Al Kepp Jan 17 '13 at 11:43
  • @AlKepp I am sorry but I don't think I understand your comment. I suggest using an array of size 256 instead of a hashset. Also my approach requires no hashing -simply mark the element with index corresponding to the character as true. What do you mean by 16 bit? – Ivaylo Strandjev Jan 17 '13 at 11:45
  • @AlKepp he uses the boolean array as marker only, and can index the array elements by the char value – Andreas Fester Jan 17 '13 at 11:45
  • @AlKepp after the edit of your comment - you never go over the array. You only add an element to the result if the corresponding cell in the array is still false. So you only iterate over the input which is inevitable. – Ivaylo Strandjev Jan 17 '13 at 11:46
  • @IvayloStrandjev I find your answer quite helpful. Could you please explain it a bit more clearly? – Suneeta Singh Jan 17 '13 at 11:47
  • He meant that UNICODE character will require two bytes to be stored – Archer Jan 17 '13 at 11:58
  • @archer still a reasonable size for an array I would say. – Ivaylo Strandjev Jan 17 '13 at 12:00
  • depends what will be the input array. if he use UNICODE chars you should extend the size of your array (as you properly said, using the size of his alphabet) – Archer Jan 17 '13 at 12:03
  • btw. one more step - print the output in the same loop, no need for a second loop... – Nim Jan 17 '13 at 12:07
  • @Nim I am assuming we are creating a method that computes the result, not just printing the result. – Ivaylo Strandjev Jan 17 '13 at 12:08
  • Can you include your timings for this solution? How much does it speed is up? – Peter Lawrey Jan 17 '13 at 12:26
  • @PeterLawrey I am sorry I did not do any timings. Still it seems certain that this solution will perform at least as efficiently as HashSet(in fact I believe it might be a few times faster). – Ivaylo Strandjev Jan 17 '13 at 12:30
  • @IvayloStrandjev The point I was eluding to is that while that portion of the code will be faster, much of the delay ~40% is in printing the results to the console. If you warmup the code and drop the output in the timing, it will be 50x faster. ;) – Peter Lawrey Jan 17 '13 at 12:47
3

The first thing to realise is that writing to the console is very expensive. If you remove

System.out.println(str);

This will speed up the code dramatically.

The other thing to note is that you code isn't run long enough to warm up the code (it won't be compiled) You should run the test for about 2 - 10 seconds to see how long it would take with warmed up code.

The time taken

  • with printing => 2,498,085 ns
  • without printing => 1,509,978 ns
  • if you warn up the code first => 43,347 ns
  • if you warm up the code and average 10,000 runs => 18,922 ns
  • optimise the code further and run one million times ~2secs => 1,287 ns

End to end that is a 2000x fold performance improvement ;)

The final code looks like

StringBuilder strArr = null;
long sT = 0;
int runs = 1000000;
for (int i = -40000; i < runs; i++) {
    if (i == 0)
        sT = System.nanoTime();

    String text = "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm" +
            "qwertyuiopasdfghjklzxcvbnm";
    BitSet bs = new BitSet(256);
    for (int j = 0, len = text.length(); j < len; j++)
        bs.set(text.charAt(j));
    strArr = new StringBuilder();
    for (int j = -1; (j = bs.nextSetBit(j+1)) >= 0; )
        strArr.append((char) j);
}

System.out.printf("Took an average of %,d ns%n", (System.nanoTime() - sT) / runs);

System.out.print(strArr);

prints

Took an average of 1,287 ns
abcdefghijklmnopqrstuvwxyz
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1.3 nanoseconds can't be right, there's something being compiled out of existence here. – Marko Topolnik Jan 17 '13 at 12:12
  • 1
    @MarkoTopolnik 1.3 ns would be wrong (about 4 clock cycles). It is 1287 ns (about 4000 clock cycles). ;) – Peter Lawrey Jan 17 '13 at 12:24
  • Yes, my Caliper benchmark confirms this. BTW I modified your code to use `new String(...)` because I think it is more realistic to pay for string creation every time (OP is probably not doing this on a constant, otherwise runtime would be irrelevant). I also insist on creating the final `char[]`. – Marko Topolnik Jan 17 '13 at 12:26
2

Let Google Caliper take care of microbenchmarking:

 0% Scenario{vm=java, trial=0, benchmark=Array} 12868.77 ns; σ=523.07 ns @ 10 trials

  us
12.9

vm: java
trial: 0
benchmark: Array

12.9 microseconds. Not quite the same as your 1319 microseconds, is it :)

For reference, this is the exact code:

import com.google.caliper.Runner;
import com.google.caliper.SimpleBenchmark;

public class Performance extends SimpleBenchmark {
  public int timeArray(int reps) {
    int sum = 0;
    for (int rep = 0; rep < reps; rep++) {
      Character[] data = { 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
          'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
          'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b',
          'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u',
          'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n',
          'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g',
          'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
          'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's',
          't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l',
          'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e',
          'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x',
          'y', 'z' };
      sum += new HashSet<>(Arrays.asList(data))
          .toArray(new Character[new HashSet<Character>
                   (Arrays.asList(data)).size()]).length;
    }
    return sum;
  }

  public static void main(String... args) {
    Runner.main(Performance.class, args);
  }
}
Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
0

You can use Google Guava to do it for you. There is a Stackoverflow discussion about it.

Community
  • 1
  • 1
christian.vogel
  • 2,117
  • 18
  • 22