8

I just started learning Scala, so please be patient :-)

I have a question about how reduceLeft behaves. Here an example:

List(1, 2, 3, 4, 5) reduceLeft (_ + _)

I wonder if the calculation can be done simultanously, e.g.:

first round:

  • process 1 calculates: 1 + 2
  • process 2 calculates: 4 + 5

second round:

  • process 1 calculates: 3 + 3

third round:

  • process 1 calculates: 6 + 9

At least that's what I would expect to happen if I just use the reduce function instead of reduceLeft. Or does reduceLeft really only does one reduction at a time?

((((1 + 2) + 3) + 4) + 5)

This would basically mean it can't be executed in parallel and one should always prefer reduce over reduceLeft/Right if possible?

om-nom-nom
  • 62,329
  • 13
  • 183
  • 228
Sebi
  • 8,323
  • 6
  • 48
  • 76
  • 2
    Not exactly. `reduceLEFT` is by definition sequential (proceeding from left to right). That's why the parallel versions of the `blahLeft` and `blahRight` HOFs are named simply `blah`. – Randall Schulz Mar 08 '13 at 02:54

2 Answers2

11

The answer is yes, and it is very easy:

List(1, 2, 3, 4, 5).par.reduce (_ + _)

The par method turns the list into a parallel collection. When you call reduce on this parallel collection, it will be executed in parallel.

See the parallel collection documentation

Régis Jean-Gilles
  • 32,541
  • 5
  • 83
  • 97
  • 2
    I have no idea who downvoted this--the answer is short, to the point, and completely correct. – Rex Kerr Mar 07 '13 at 23:00
  • Actually I thought this was you, as your previous answer mentioned that `reduce` was not parallelizable ^^. After checking in the REPL that it is indeed executed on several threads, I was going to reply, only to see that you have updated that statement, and the downvote is gone (or balanced by an upvote, I can't tell)... – Régis Jean-Gilles Mar 07 '13 at 23:02
  • 1
    Balanced by my upvote, because your answer is at least as good if not better than mine (and my initial one was not phrased correctly). – Rex Kerr Mar 07 '13 at 23:04
  • I no longer remember what I wrote--did I actually say _reduce_ wasn't parallelizable (false) or that _reduceLeft_ wasn't (true)? I did jump from reduce to fold and start talking about aggregate without really explaining what was going on. – Rex Kerr Mar 07 '13 at 23:06
  • I was pretty sure that you said that the parallel analog to `reduceLeft` was not `reduce`, but `aggregate`. But I can't find this anymore in your post's history?!? Maybe I am too tired, sorry for that ^^. In any case, who cares now, your answer as it stands is nice and comprehensive. – Régis Jean-Gilles Mar 07 '13 at 23:18
  • I forgot to mention that I was generalizing to include fold as well as reduce, which I fixed on edit. – Rex Kerr Mar 08 '13 at 00:05
  • Wow, I wasn't aware that I need to explicitly turn the collection into a parallel one. I guess a lot of students will think their code is multi-threaded while they are missing the `par` statement. I would expect that parallel versions are always preferred where possible. So I think all Scala books should be updated to include a big red warning to all new students about it :-) – Sebi Mar 08 '13 at 06:40
  • When you parallelize simple collection operations on small collections, your speed-up is likely to be quite rubbish. Typically, if one cpu core gives 100%, don't expect 400% from four cores. Don't even expect 120%, although you might get lucky. The reason is described by Amdahl's Law, which is loosely that the maximum speed-up is inversely proportional to the fraction of the code that gets parallelised. Or, more simply, the overheads get in the way. If you want highly efficient parallelism, parallel collections are usually the last choice, not the first. They are *easy* though. – Rick-777 Mar 09 '13 at 11:40
  • I would not go as far as to say they are the last choice. If the amount of data you are working on is substantial, you should actually get a nice speedup for a very little amount of work and obfuscation. It does not seem too bad for me. Just be aware that parallel collections are no fit for small data sets. – Régis Jean-Gilles Mar 09 '13 at 14:48
6

As you've noticed, reduceLeft is not parallelizable, as it explicitly assumes a form that is not associative: (B,A) => B.

As long as you use an associative operator, reduce is parallelizable.

There's also an analog of foldLeft called aggregate that takes two functions: one to map into a combinable form, and two an associative one to merge the elements: (B,A)=>B, (B,B) => B.

This one, as long as the two functions will agree on the output, and you have a zero to mix in wherever you want, is parallelizable.

So if you want to be able to be parallel,

reduceLeft/Right ->  reduce
foldLeft/Right   ->  aggregate

There may be some cases where reduce is more restrictive than reduceLeft but aggregate will do the trick.

That said, this only makes the statement able to be parallel. For it to actually be parallel you need to use a collection that inherits from ParIterable, and these all have Par in their names: ParVector, etc.. The easiest way to get a parallel collection is to call .par on a regular one (.seq goes the other way, from parallel to non-parallel). It's done this way because in general there's no reason to be parallel except for speed, but parallelism adds overhead. So you should only operate in parallel if there's enough work to do, and while you may know that, the compiler probably doesn't. Thus, you are to explicitly select which kind of collection you want. (Parallel collections return parallel, and sequential return sequential.)

Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • It would be great if you can add whether `reduce`/`aggregate` are automatically parallelized for every collection (this is how I would currently interpret your answer) or whether this only applies to parallel collections obtained via `par` (this is what I guess). – bluenote10 Mar 08 '13 at 09:20