4

Suppose I had the following code:

public Set<String> csvToSet(String src) {
    String[] splitted = src.split(",");
    Set<String> result = new HashSet<>(splitted.length);
    for (String s : splitted) {
        result.add(s);
    }
    return result;
}

so I need to transform an array into Set. And Intellij Idea suggests to replace my for-each loop with Collection.addAll one-liner so I get:

...
Set<String> result = new HashSet<>(splitted.length);
result.addAll(Arrays.asList(splitted));
return result;

The complete inspection message is:

This inspection warns when calling some method in a loop (e.g. collection.add(x)) could be replaced when calling a bulk method (e.g. collection.addAll(listOfX). If checkbox "Use Arrays.asList() to wrap arrays" is checked, the inspection will warn even if the original code iterates over an array while bulk method requires a Collection. In this case the quick-fix action will automatically wrap an array with Arrays.asList() call.

From inspection description it sounds like it works as expected.

If we refer to a top answer of a question about converting an array into Set (How to convert an Array to a Set in Java) the same one liner is suggested:

Set<T> mySet = new HashSet<T>(Arrays.asList(someArray));

Even though creating an ArrayList from array is O(1), I do not like the idea of creating an additional List object.

Usually I trust Intellij inspections and assume it does not offer anything less efficient. But today I am curious why both: top SO answer and Intellij Idea (with default settings) recommend using same one-liner with creating useless intermediate List object while there is also a Collections.addAll(destCollection, yourArray) since JDK 6.

The only reason I see for it is that both (inspection and answers) are too old. If so, here is the reason to improve intellij idea and give more votes to an answer proposing Collections.addAll() :)

Kirill
  • 6,762
  • 4
  • 51
  • 81
  • What are you asking? Which is best? The 3rd. – Andy Turner Jan 11 '18 at 12:31
  • @AndyTurner which is best, where best = highest performance? – Kirill Jan 11 '18 at 12:32
  • Using `asList(array)` is just wrapping some convenient behaviour around an array; and doing that allows you to use the `HashSet` constructor so you can create the set and fill it in one step. I can't see why there would be any performance downside. – khelwood Jan 11 '18 at 12:32
  • 4
    @Derp best as in it's the most concise, and the performance hit of creating that list - if there is any at all - is not worth worrying about. Trying not to create the list is (likely-unwarranted) microoptimization. The oft-repeated advice here is: write the most readable code; if you find the performance is lacking, profile it, and only when you find that *this* is the bottleneck should you worry about rewriting it. – Andy Turner Jan 11 '18 at 12:33
  • @khelwood with for, there are no calls to Iterator method, so there are less delegations used so maybe some compiler loop optimization works better in original code (i.e. software pipelining https://en.wikipedia.org/wiki/Software_pipelining ). I think this is not done well by IntelliJ – Mladen Adamovic Jan 11 '18 at 12:48
  • @Derp Why don't you benchmark the time and memory of the two ways? Just create a huge array and mesure the results. – Robert D. Mogos Jan 11 '18 at 12:49
  • @RobertD.Mogos still do not have hands-on experience with JMH. Anyway I found it worth asking this question, someone else could be interested too and he will probably find benchmark results here (if they will appear under the question) – Kirill Jan 11 '18 at 12:54
  • @Derp see my answer below – Robert D. Mogos Jan 11 '18 at 17:03

2 Answers2

4

A hint as to why Intellij doesn't suggest the Arrays.asList replacement for

Set<String> result = new HashSet<>(splitted.length);
result.addAll(Arrays.asList(splitted));
return result;

is in the source code for HashSet(Collection):

public HashSet(Collection<? extends E> c) {
    map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
    addAll(c);
}

Note that the capacity of the set isn't the size of c.

As such, the change would not be semantically equivalent.


Don't worry about creating the List. It is really cheap. It's not free; but you would have to be using it in a really performance-critical loop to notice.

Andy Turner
  • 137,514
  • 11
  • 162
  • 243
  • this wrap is cheap, unless you use it in a very performance sensitive way – Mladen Adamovic Jan 11 '18 at 12:50
  • 3
    ouch :-) It seems I really need more sleep! :-) – Mladen Adamovic Jan 11 '18 at 13:09
  • 3
    This answer demonstrates that when the OP attempted a premature optimization, they instead made a *pessimization*. The value of the HashSet constructor argument was incorrect and ended up creating a HashSet that was too small and much more likely to end up with collisions. The API developers know better how to optimally initialize a HashSet to hold a given collection, so make use of their knowledge. Also, by reusing an often-used API method you are more likely to have it JIT-compiled. – Klitos Kyriacou Jan 11 '18 at 13:23
  • Interesting answer but imo a bit off the question. @KlitosKyriacou I was not offered to use constructor in any way. I was offered to use Set's instance's `addAll` which requires collection as argument, while there is a static method `Collections.addAll(yourDestCollection, yourSrcArr)`. – Kirill Jan 11 '18 at 16:24
  • 1
    @KlitosKyriacou Close. The incorrect initial size is indeed too small, but the result won't be more collisions. Instead, the table will have to be reallocated partway through addition of the elements. But yes, using the `HashSet(Collection)` constructor instead is probably the best approach. – Stuart Marks Jan 12 '18 at 07:55
1

I wrote a small function to measure the performance of the three ways of adding the array to a HashSet and here are the results.

First the base code used by all of them that would generate an array of maxSize with values between 0-100

    int maxSize = 10000000; // 10M values
    String[] s = new String[maxSize];
    Random r = new Random();

    for (int i = 0; i < maxSize; i++) {
        s[i] = "" + r.nextInt(100);
    }

Then the benchmark function:

public static void benchmark(String name, Runnable f) {
    Long startTime = System.nanoTime();
    f.run();
    Long endTime = System.nanoTime();
    System.out.println("Total execution time for: " + name + ": " + (endTime-startTime) / 1000000 + "ms");
}

So first way is using your code with a loop and for 10M values it took between 150ms and 190ms ( I ran the benchmark several times for each method)

    Main.benchmark("Normal loop ", () -> {
        Set<String> result = new HashSet<>(s.length);
        for (String a : s) {
            result.add(a);
        }
    });

Second is using result.addAll(Arrays.asList(s)); and it took between 180ms and 220ms

        Main.benchmark("result.addAll(Arrays.asList(s)): ", () -> {
            Set<String> result = new HashSet<>(s.length);
            result.addAll(Arrays.asList(s));
        });

Third way is using Collections.addAll(result, s); and it took between 170ms and 200ms

    Main.benchmark("Collections.addAll(result, s); ", () -> {
        Set<String> result = new HashSet<>(s.length);
        Collections.addAll(result, s);
    });

Now the explanation. From a runtime complexity they all run in O(n) meaning that for N values N operations are going to run (basically adding N values).

From a memory complexity point of view, is again, for all O(N). There's only the new HashSet which is created.

Arrays.asList(someArray) is not creating a new array, is just creating a new object that has a reference to that array. You can see it in the java code:

    private final E[] a;

    ArrayList(E[] array) {
        a = Objects.requireNonNull(array);
    }

Besides that, all the addAll methods are going to do exactly what you did, a for-loop:

// addAll method for Collections.addAll(result, s);
public static <T> boolean addAll(Collection<? super T> c, T... elements) {
    boolean result = false;
    for (T element : elements)
        result |= c.add(element);
    return result;
}

// addAll method for result.addAll(Arrays.asList(s));
public boolean addAll(Collection<? extends E> c) {
    boolean modified = false;
    for (E e : c)
        if (add(e))
            modified = true;
    return modified;
}

To conclude, the runtime difference being so small, IntelliJ suggests a way to write code in a more clear way and less code.

Robert D. Mogos
  • 900
  • 7
  • 14
  • 1
    `System.currentTimeMillis()` is not monotonic. Use `System.nanoTime()` instead (or JMH). – Klitos Kyriacou Jan 11 '18 at 18:19
  • Also, you should time `Set result = new HashSet<>(Arrays.asList(s));` which is the canonical way of creating a set from an array. It should be faster than creating a HashSet with lower-than-optimal constructor argument first and then calling `addAll()`. – Klitos Kyriacou Jan 11 '18 at 18:23
  • @KlitosKyriacou i updated the response. `Set result = new HashSet<>(Arrays.asList(s));` is basically doing `map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16)); addAll(c);` and the benchmark says `180ms to 210ms` – Robert D. Mogos Jan 11 '18 at 18:35
  • The presizing you do here is too small. It creates the HashSet with N *buckets*, not space for N *entries* at the default load factor. Thus, all of the benchmarks will end up unnecessarily reallocating and rehashing the table partway through. Using the constructor recommended by Klitos Kyriacou will create a right-sized table and avoid reallocation. – Stuart Marks Jan 12 '18 at 07:52
  • @StuartMarks I am not the one doing the presizing, that’s a copy paste of the code from The suggested constructor. – Robert D. Mogos Jan 12 '18 at 08:03
  • Out of the 200 or so milliseconds for each run, I wonder how much of this time is actually due to execution of a compiled version of the code. The time comprises: (first 10000(?) iterations of `add()` in interpreted byte code) + (time to compile the bytecode) + (remainder of the 10M iterations of `add()` executed in native code). I suspect a large chuck of that time might have been due to compilation. Try calling `f.run()` in a loop of 100 iterations and get the average. – Klitos Kyriacou Jan 12 '18 at 08:08
  • 1
    @KlitosKyriacou after running the average for 100 iterations each: – Robert D. Mogos Jan 12 '18 at 10:03
  • `Total execution time for: Normal loop : 170.06ms` – Robert D. Mogos Jan 12 '18 at 10:04
  • `Total execution time for: result.addAll(Arrays.asList(s)): : 214.74ms` – Robert D. Mogos Jan 12 '18 at 10:04
  • `Total execution time for: Collections.addAll(result, s); : 209.09ms` – Robert D. Mogos Jan 12 '18 at 10:05
  • `Total execution time for: new HashSet<>(Arrays.asList(s));: 216.82ms` – Robert D. Mogos Jan 12 '18 at 10:05
  • 2
    Interesting results. Looks like the explicit loop is fastest after all. – Klitos Kyriacou Jan 12 '18 at 10:28
  • The only difference between that and the body of Collections.addAll() is that the latter maintains a `result` boolean variable in the loop. So if you don't want to know whether your HashSet construction actually succeeded or not, and this is in a time-critical part of the code, perhaps the explicit loop might be a choice to consider. – Klitos Kyriacou Jan 12 '18 at 10:34
  • So perhaps a more efficient implementation of `Collections.addAll()` would have been to *not* maintain `result` inside the loop, but just call `add()` and throw away the return value; and then at the end return a boolean indicating if the increase in size of the collection equals the number of elements in the array. @StuartMarks fyi - I would be interested in your comments on this. – Klitos Kyriacou Jan 12 '18 at 10:36
  • That's why there's a difference between `Collections.addAll` and `results.addAll`. The first one is using `result |= c.add(element);` and the later the `if statement` – Robert D. Mogos Jan 12 '18 at 10:44
  • 2
    Thinking about the HashSet constuctor argument, the value of array size is not too low *in this particular test case*. In fact, it's far too high! You are adding 10 million elements but they are random numbers between 0 and 99. Therefore, the vast majority are duplicates and the HashSet will end up with only 100 elements at most. – Klitos Kyriacou Jan 12 '18 at 11:25
  • Indeed. I thought the same thing and redid a test with random values between 0-10M. There was a small change – Robert D. Mogos Jan 12 '18 at 11:47
  • `Total execution time for: Normal loop : 2942.5ms` `Total execution time for: result.addAll(Arrays.asList(s)): : 3023.7ms` `Total execution time for: Collections.addAll(result, s); : 2879.5ms` `Total execution time for: new HashSet<>(Arrays.asList(s));: 3001.2ms` – Robert D. Mogos Jan 12 '18 at 13:14
  • @RobertD.Mogos Sorry, I should have been clearer. The presizing that divides by 0.75 is indeed from the constructor, and it sizes correctly. The presizing I meant that is too small is the code that does `new HashSet<>(s.length)` and then proceeds to add `s.length` presumably unique elements. This was originally from the OP, but if the benchmark is dong this, it's suboptimal, as it's also measuring reallocation. – Stuart Marks Jan 12 '18 at 17:52
  • 1
    @KlitosKyriacou I'm doubtful whether the looping technique makes a significant difference. The high variances in the initial benchmark runs listed in the answer tell me that it's hard to draw conclusions from any of the results. More precise benchmarking methodology is called for. – Stuart Marks Jan 12 '18 at 18:01
  • Thanks for test. Decided to try JMH. Here is my first attempt: https://gitlab.com/gavvvr/array2set-jmh-benchmark/pipelines/16144739/builds I run pipeline job 5 times. So you can see 5 different results because of non-constant computing resources in CI env. You can fork and play with it :) – Kirill Jan 14 '18 at 22:00