6

Let's run the following line of code several times:

Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)

The results are quite interesting:

scala> Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
res10: Int = 8
scala> Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
res11: Int = 20

However clearly it should be like in its sequential version:

scala> Set(1,2,3,4,5,6,7).fold(0)(_ - _)
res15: Int = -28

I understand that operation - is non-associative on integers and that's the reason behind such behavior, but my question is quite simple: doesn't it mean that fold should not be parallelized in .par implementation of collections?

tkachuko
  • 1,956
  • 1
  • 13
  • 20

3 Answers3

8

When you look at the standard library documentation, you see that fold is undeterministic here:

Folds the elements of this sequence using the specified associative binary operator. The order in which operations are performed on elements is unspecified and may be nondeterministic.

As an alternative, there's foldLeft:

Applies a binary operator to a start value and all elements of this sequence, going left to right. Applies a binary operator to a start value and all elements of this collection or iterator, going left to right.

Note: might return different results for different runs, unless the underlying collection type is ordered or the operator is associative and commutative.

As Set is not an ordered collection, there's no canonical order in which the elements could be folded, so the standard library allows itself to be undeterministic even for foldLeft. If you would use an ordered sequence here, foldLeft would be deterministic in that case.

Fabian Schmitthenner
  • 1,696
  • 10
  • 22
  • Thanks a lot for your reply. Postponed doc reading as usual and was acting based on intuition :) – tkachuko Aug 03 '16 at 04:09
  • 1
    The scaladoc is a lie. `SeqLike#fold` actually uses `foldLeft`: https://github.com/scala/scala/blob/1706a37eb84ec252aea77bccebad3e48448534ad/src/library/scala/collection/TraversableOnce.scala#L212 – Michael Zajac Aug 03 '16 at 04:11
  • 2
    But it does not mean it is the same for par. They are actually implemented independently in parallel collections. – tkachuko Aug 03 '16 at 04:12
  • 1
    That's not a lie because `SeqLike` is a trait and this is only the default implementation that can be overwritten later (because it's not `final`) – Fabian Schmitthenner Aug 03 '16 at 04:13
  • It isn't the same for parallel collections. I'm pointing out that it _is_ the same for non-parallel collections that extend `TraversableOnce` (pretty much everything). – Michael Zajac Aug 03 '16 at 04:14
  • `SeqLike` inherits `fold` directly from `TraversableOnce`, so yes, it is a scaladoc copy failure in that case. There are many scaladoc errors like this throughout the standard library. – Michael Zajac Aug 03 '16 at 04:15
  • hm, you're right; haven't read enough api etheir. also, `ParSeq` and `Seq` have no dependency, so my api reference is also wrong – Fabian Schmitthenner Aug 03 '16 at 04:21
  • the difference between deterministic/nondeterministic should still be valid for ordered ParSeq's though – Fabian Schmitthenner Aug 03 '16 at 04:23
  • 1
    hm, I don't think there's a _copy_ failure in the scaladoc, it's just staying the same as the `SeqLike` trait extends `TraversableOnce`. And this also explicitly allows custom implementations to overwrite the default behaviour and do something more efficient – Fabian Schmitthenner Aug 03 '16 at 04:40
  • will check the scala book today about this – Fabian Schmitthenner Aug 03 '16 at 04:44
4

The scaladoc does say:

The order in which the elements are reduced is unspecified and may be nondeterministic.

So, as you stated, a binary operation applied in ParSet#fold that is not associative is not guaranteed to produce a deterministic result. The above text is warning is all you get.

Does that mean ParSet#fold (and cousins) should not be parallelized? Not exactly. If your binary operation is commutative and you don't care about non-determinism of side-effects (not that a fold should have any), then there isn't a problem. However, you are hit with the caveat of needing to tread carefully around parallel collections.

Whether or not it is correct is more of a matter of opinion. One could argue that if a method can result in accidental non-determinism, that it should not exist in a language or library. But the alternative is to clip out functionality so that ParSet is missing functionality that is present in most of the other collection implementations. You could use the same line of thinking to also suggest the removal of Stream#foreach to prevent people from accidentally triggering infinite loops on infinite streams, but should you?

Community
  • 1
  • 1
Michael Zajac
  • 55,144
  • 7
  • 113
  • 138
  • 1
    So we are talking about balance between reading documentation and having deterministic methods. Well, still debatable, but example with the Stream is nice. Thanks for answer. – tkachuko Aug 03 '16 at 04:14
1

It is useful to parallelize fold operation with high workloads, however, to guarantee a deterministic output from calling of collection.par.fold(z)(f), the following conditions must hold:
1- f(f(a,b),c) == f(a,f(b,c)) // Associativity
2- f(z,a) == f(a,z) == a , where z is the neutral element for f (like 0 for sum, and 1 for multiplication).

Fabian's answer suggests using foldLeft instead. Although this is deterministic, using .par with it won't really parallelize anything. because foldLeft is sequential by nature.