How to create a Scala parallel collection from a Java collection

Question

The easiest way to convert a Java Collection to a Scala equivalent is using JavaConversions, since Scala 2.8.. These implicit defs return wrappers for the contained Java Collection.

Scala 2.9 introduced parallel collections, where operations on a collection can be executed in parallel and the result collected later. This is easily implemented, converting an existing collection into a parallel one is as simple as:

myCollection.par

But there's a problem with using 'par' on collections converted from Java collections using JavaConversions. As described in Parallel Collection Conversions, inherently sequential collections are 'forced' into a new parallel collection by evaluating all of the values and adding them to the new parallel collection:

Other collections, such as lists, queues or streams, are inherently sequential in the sense that the elements must be accessed one after the other. These collections are converted to their parallel variants by copying the elements into a similar parallel collection. For example, a functional list is converted into a standard immutable parallel sequence, which is a parallel vector.

This causes problems when the original Java collection is intended to be lazily evaluated. For instance, if only a Java Iterable is returned, later converted to a Scala Iterable, there's no guarantee that the contents of the Iterable are intended to be accessed eagerly or not. So how should a parallel collection be created from a Java collection without sustaining the cost of evaluating each element? It is this cost I am trying to avoid by using a parallel collection to execute them in parallel and hopefully 'take' the first n results that are offered.

According to Parallel Collection Conversions there are a series of collection types that cost constant time, but there doesn't appear to be a way of getting a guarantee that these types can be created by JavaConversions (e.g. 'Set' can be created, but is that a 'HashSet'?).

Note that it would be a better idea to use JavaConverters rather than JavaConversions as the latter, with that in there you'd be able to do something like .asScala.toList.par. — Sean Parsons, Oct 12 '12 at 15:28

score 4 · Accepted Answer · answered Oct 12 '12 at 13:07

First, every collection obtained via JavaConversions from a Java collection is not a by-default parallelizable Scala collection - this means that it will always be reevaluated into its corresponding parallel collection implementation. The reason for this is that parallel execution relies on the concepts of Splitters at least - it has to be splittable into smaller subsets that different processors can then work on.

I don't know how your Java collection looks in the data-structure sense, but if it's a tree-like thing or an array underneath whose elements are evaluated lazily, chances are that you can easily implement a Splitter.

If you do not want to eagerly force a lazy collection that implements a Java collection API, then your only option is to implement a new type of a parallel collection for that particular lazy Java collection. In this new implementation you have to provide means of splitting the iterator (that is, a Splitter).

Once you implement this new parallel collection which knows how to split your data-structure, you should create a custom Scala wrapper for your specific Java collection (at this point it's just a little bit of extra boilerplate, see how it's done in JavaConversions) and override its par to return your specific parallel collection.

You might even be able to do this generically for indexed sequences. Given that your Java collection is a sequence (in Java, a List) with a particularly efficient get method, you could implement the Splitter as an iterator that calls get within the initial range from 0 to size - 1, and is split by subdividing this range.

If you do, patches to the Standard library are always welcome.

score 1 · Answer 2 · answered Oct 12 '12 at 13:07

Parallel requires random access and java.lang.Iterable doesn't provide it. This is a fundamental mismatch that no amount of conversions will comfortably get you past.

To use a non-programming analogy, you cannot get a person from Australia to England by sending one person from Singapore to England and another from Australia to Singapore at the same time.

Or in programming if you're processing a live stream of data you cannot parallelise it by processing the data from now at the same time as the data from five minutes ago without adding latency.

You would need something that provides at least some random access, like java.util.List.listIterator(Int) instead of Iterable.

I guess I assumed each call to retrieve the next element (i.e. Iterable.iterator().next()) was run inside a thread. — Dan Gravell, Oct 12 '12 at 13:16

How to create a Scala parallel collection from a Java collection

2 Answers2