Big O: what is the time complexity for this algorithm?

Question

What is the best case and worst case time complexity for my method below?

I know ArrayList.add() is of O(1) time complexity but not sure about loaded.stream().distinct().collect(Collectors.toList());

  public static int countUnique(WordStream words) {

    ArrayList<String> loaded = new ArrayList<String>();
    ArrayList<String> empty = new ArrayList<String>();

    // Fill loaded with WordStream words
    for (String i : words) {
      loaded.add(i);
    }

    empty = (ArrayList<String>) loaded.stream().distinct().collect(Collectors.toList());

    return empty.size();
  }

score 2 · Answer 1 · answered Mar 14 '16 at 23:48

2

Firstly, ArrayList() isn't O(1) for all operations, but it is for add().

distinct() is O(n), because it must examine all elements. Each iteration is O(1) because it's backed by a HashSet, which is O(1).

You could replace your code with:

return (int)loaded.parallelStream().distinct().count();

which would be a lot faster, but still O(n)

answered Mar 14 '16 at 23:48

Bohemian

412,405
93
575
722

1

`ArrayList` is *not* `O(1)` for `add()`. The [documentation](http://docs.oracle.com/javase/8/docs/api/?java/util/ArrayList.html) specifies `add()` as having “amortized constant time” explained as “adding n elements requires `O(n)` time”, which is what `O(n)` is all about as we only look at the complexity of the entire task consisting of *n* elements. By the way, I have strong doubts about the benefit of `.parallelStream()` here. The minimum you should do is `.parallelStream().unordered()` (in theory `count()` implies unordered but to my experience, you better don’t trust the implementation) – Holger Mar 16 '16 at 13:46
In my tests, `.parallelStream().unordered().distinct().count()` was significantly faster than `.parallelStream().distinct().count()` but rarely reached the performance of the sequential `.stream().distinct().count()`… – Holger Mar 16 '16 at 13:51
@Holger Interesting. I just ran tests (with warmup) of the 4 combos of `parallelStream()` and without or without `unordered()` on lists sizes powers of ten up to 1M: Surprisingly, `.stream().unordered().distinct().count()` was consistently and noticeably faster. Surprising because I would have assumed that the stream of a list was ordered and couldn't be unordered and/or that `parallelStream()` should be faster given sufficient stream size. – Bohemian Mar 16 '16 at 18:35
The ordered property is a feature for your sake so you can get, e.g. a resulting list in the right order when collecting the final result into a list. If you don’t care about the order, you can always release that contract to allow potential performance increase, see http://stackoverflow.com/a/29218074. In case of `distinct()` it may drive the decision to use a `LinkedHashSet` or a `HashSet` which may affect the performance, though in my environment, the effect on a sequential stream was little to not noticeable. To me it’s surprising that `count()` doesn’t automatically implies `.unordered()` – Holger Mar 16 '16 at 18:49
It’s a common wrong assumption that there is always a size at which `parallelStream()` will be faster. Some problems simply aren’t good parallelizable. Parallel `distinct()` implies each thread collects into its own set before they get merged, basically repeating the distinct operation for the remaining elements. Depending on the element distribution that can be much more expensive than a single sequential run. It *can* have a benefit if there are lots of duplicates, compared to the number of remaining elements. To judge that, you need to predict the result beforehand… – Holger Mar 16 '16 at 18:54
@holger in my experience, parallel is always more expensive than single threaded for small streams with cheap processing needs, as expected since thread scheduling and swapping etc is expensive. – Bohemian Mar 16 '16 at 21:47
I think most people understood that parallel doesn’t pay off for small streams; the tougher mind job is understanding that for some tasks it will never pay off, even for the largest streams. It’s not only the threading overhead, you often need a different (more expensive) algorithm for parallel execution or, as shown with the `distinct` example, the local results have to be expensively merged afterwards. Also, the more parts of the operation the hot spot optimizer can move out of the loop, thus not depending on the number of elements anymore, the lesser benefit of parallel execution… – Holger Mar 17 '16 at 09:19

Andy Turner · Answer 2 · 2016-03-16T13:31:52.223

1

You could implement this much more concisely by not using streams:

HashSet<String> loaded = new HashSet<>();
for (String i : words) {
  loaded.add(i);
}
return loaded.size();

You don't get much benefit from the parallel stream since you've got this loop being executed serially.

This approach would be O(n) also.

As noted by @Holger, if WordStream is a Collection (rather than a mere Iterable), it could be implemented even more concisely:

return new HashSet<>(words).size();

However, it is not specified in the question whether WordStream is actually a Collection.

edited Mar 16 '16 at 13:31

answered Mar 14 '16 at 23:49

Andy Turner

137,514
11
162
243

1

Yeah, and in case `WordStream` is also a `Collection` (at least we know from the code that it must be an `Iterable`), it could be simplified as `return new HashSet<>(words).size();`… – Holger Mar 16 '16 at 13:25
@Holger I'd wanted to write it like that too. However, I suspected that it would more likely be a plain `Iterable` as it is called a "stream", so just kept it like this. – Andy Turner Mar 16 '16 at 13:27

Big O: what is the time complexity for this algorithm?

2 Answers2