Must access to scala.collection.immutable.List and Vector be synchronized?

Question

I'm going through Learning Concurrent Programming in Scala, and encountered the following:

In current versions of Scala, however, certain collections that are deemed immutable, such as List and Vector, cannot be shared without synchronization. Although their external API does not allow you to modify them, they contain non-final fields.

Tip: Even if an object seems immutable, always use proper synchronization to share any object between the threads.

From Learning Concurrent Programming in Scala by Aleksandar Prokopec, end of Chapter 2 (p.58), Packt Publishing, Nov 2014.

Can that be right?

My working assumption has always been that any internal mutability (to implement laziness, caching, whatever) in Scala library data structures described as immutable would be idempotent, such that the worst that might happen in a bad race is work would be unnecessarily duplicated. This author seems to suggest correctness may be imperiled by concurrent access to immutable structures. Is that true? Do we really need to synchronize access to Lists?

Much of my transition to an immutable-heavy style has been motivated by a desire to avoid synchronization and the potential contention overhead it entails. It would be an unhappy big deal to learn that synchronization cannot be eschewed for Scala's core "immutable" data structures. Is this author simply overconservative?

Scala's documentation of collections includes the following:

A collection in package scala.collection.immutable is guaranteed to be immutable for everyone. Such a collection will never change after it is created. Therefore, you can rely on the fact that accessing the same collection value repeatedly at different points in time will always yield a collection with the same elements.

That doesn't quite say that they are safe for concurrent access by multiple threads. Does anyone know of an authoritative statement that they are (or aren't)?

so far I've found - http://www.scala-lang.org/old/node/7385 – dk14 Apr 18 '15 at 07:54 — dk14, Apr 18 '15 at 07:54
@dk14 thanks! so others have wondered and discussed... – Steve Waldman Apr 18 '15 at 08:33 — Steve Waldman, Apr 18 '15 at 08:33

score 7 · Answer 1 · edited May 23 '17 at 11:53

7

It depends on where you share them:

it's not safe to share them inside scala-library
it's not safe to share them with Java-code, reflection

Simply saying, these collections are less protected than objects with only final fields. Regardless that they're same on JVM level (without optimization like ldc) - both may be fields with some mutable address, so you can change them with putfield bytecode command. Anyway, var is still less protected by the compiler, in comparision with java's final, scala's final val and val.

However, it's still fine to use them in most cases as their behaviour is logically immutable - all mutable operations are encapsulated (for Scala-code). Let's look at the Vector. It requires mutable fields to implement appending algorithm:

private var dirty = false

//from VectorPointer
private[immutable] var depth: Int = _
private[immutable] var display0: Array[AnyRef] = _
private[immutable] var display1: Array[AnyRef] = _
private[immutable] var display2: Array[AnyRef] = _
private[immutable] var display3: Array[AnyRef] = _
private[immutable] var display4: Array[AnyRef] = _
private[immutable] var display5: Array[AnyRef] = _

which is implemented like:

val s = new Vector(startIndex, endIndex + 1, blockIndex)
s.initFrom(this) //uses displayN and depth
s.gotoPos(startIndex, startIndex ^ focus) //uses displayN
s.gotoPosWritable //uses dirty
...
s.dirty = dirty

And s comes to the user only after method returned it. So it's not even concern of happens-before guarantees - all mutable operations are performed in the same thread (thread where you call :+, +: or updated), it's just kind of initialization. The only problem here is that private[somePackage] is accessible directly from Java code and from scala-library itself, so if you pass it to some Java's method it could modify them.

I don't think you should worry about thread-safety of let's say cons operator. It also has mutable fields:

final case class ::[B](override val head: B, private[scala] var tl: List[B]) extends List[B] {
  override def tail : List[B] = tl
  override def isEmpty: Boolean = false
}

But they used only inside library methods (inside one-thread) without any explicit sharing or thread creation, and they always return a new collection, let's consider take as an example:

override def take(n: Int): List[A] = if (isEmpty || n <= 0) Nil else {

    val h = new ::(head, Nil)
    var t = h
    var rest = tail
    var i = 1
    while ({if (rest.isEmpty) return this; i < n}) {
      i += 1
      val nx = new ::(rest.head, Nil)
      t.tl = nx //here is mutation of t's filed 
      t = nx
      rest = rest.tail
    }
    h
}

So here t.tl = nx is not much differ from t = nx in the meaning of thread-safety. They both are reffered only from the single stack (take's stack). Althrought, if I add let's say someActor ! t (or any other async operation), someField = t or someFunctionWithExternalSideEffect(t) right inside the while loop - I could break this contract.

A little addtion here about relations with JSR-133:

1) new ::(head, Nil) creates new object in the heap and puts its address (lets say 0x100500) into the stack(val h =)

2) as long as this address is in the stack, it's known only to the current thread

3) Other threads could be involved only after sharing this address by putting it into some field; in case of take it has to flush any caches (to restore the stack and registers) before calling areturn (return h), so returned object will be consistent.

So all operations on 0x100500's object are out of scope of JSR-133 as long as 0x100500 is a part of stack only (not heap, not other's stacks). However, some fields of 0x100500's object may point to some shared objects (which might be in scope JSR-133), but it's not the case here (as these objects are immutable for outside).

I think (hope) the author meant logical synchronization guarantees for library's developers - you still need to be careful with these things if you're developing scala-library, as these vars are private[scala], private[immutable] so, it's possible to write some code to mutate them from different threads. From scala-library developer's perspective, it usually means that all mutations on single instance should be applied in single thread and only on collection that invisible to a user (at the moment). Or, simply saying - don't open mutable fields for outer users in any way.

P.S. Scala had several unexpected issues with synchronization, which caused some parts of the library to be surprisely not thread-safe, so I wouldn't wonder if something may be wrong (and this is a bug then), but in let's say 99% cases for 99% methods immutable collections are thread safe. In worst case you might be pushed from usage of some broken method or just (it might be not just "just" for some cases) need to clone the collection for every thread.

Anyway, immutability is still a good way for thread-safety.

P.S.2 Exotic case which might break immutable collections' thread-safety is using reflection to access their non-final fields.

A little addition about another exotic but really terrifying way, as it pointed out in comments with @Steve Waldman and @axel22 (the author). If you share immutable collection as member of some object shared netween threads && if collection's constructor becomes physically (by JIT) inlined (it's not logically inlined by default) && if your JIT-implementation allows to rearrange inlined code with normal one - then you have to synchronize it (usually is enough to have @volatile). However, IMHO, I don't believe that last condition is a correct behaviour - but for now, can't neither prove nor disprove that.

edited May 23 '17 at 11:53

Community

1
1

answered Apr 18 '15 at 08:33

dk14

22,206
4
51
88

Thanks for a very thoughtful response! As a practical matter, your suggestion is reasonable, that many structures are immutable from the perspective of clients outside of a package or implementing scope, but can be mutated by code within that scope, so care must be taken to ensure single-threaded access when working from within. Still, I think the author maybe formally right. Suppose I call take on a list + set the return value to a var field. Does anything in the Java memory model guarantees a second thread will see the completed `h` rather than some intermediate value absent synchronization? – Steve Waldman Apr 18 '15 at 11:38
When you assign it to the `var`, you assign the address (like #15 in bytecode), but not the value. So another thread may see either previous address stored to the var or new address stored to the var. So any misssynchronization can lead only to the whole value to corrupt - but not the part of. As new address will become available for your function only after new collection is fully complete (after `areturn` command of `take` is executed). – dk14 Apr 18 '15 at 11:45
Or "all mutable operations happen before external field's update", which is obvious. I think the author means that immutable API doesn't gurantee you that there is no mutable parts inside,which can cause unexpected side-effects. Scala Collections are not so good but legit example as introducing `var`s there is a significant risk for scala's developement itself. From FP's perspective that means - you should always be sure that function you call is pure. – dk14 Apr 18 '15 at 11:50
Re the all-or-nothingness of modifications since an address is returned, I don't think that's formally right. (I suspect/hope it is right in practice though!) See e.g. https://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html Even nonfinal fields set in a ctor are not guaranteed consistent across threads. Even though an address is returned, different values may result from dereferencing the address, as threads may use values from different caches rather than main memory. Absent synchronization, few guarantees are made about how/when caches are reconciled. Plus reordering, and inlining. – Steve Waldman Apr 18 '15 at 12:10
If you concern about JMM, here is a quotation: The semantics of other operations, such as reads of array lengths, executions of checked casts, and **invocations of virtual methods**, are not directly affected by data races. The address is returned from virtual method call (`take`), so it's not consern of JSR-133 at all. Compiler can't reorder commands between methods. – dk14 Apr 18 '15 at 12:13
It seems like the scala libraries ought to be internally synchronizing access to the fields that they mutate (Sometimes it would be sufficient to declare them @volatile). Or, as you suggest, they should skip the clever optimizations and be pure — immutables should really be that, with final fields set in constructors. It's nice, of course, to have clever optimizations and no synchronization overhead or contention. But I think it's not correct, they are relying on behavior-in-practice that is not guaranteed. – Steve Waldman Apr 18 '15 at 12:15
Don't forget that `take` is returned to the same thread, so no field can be cached by other thread before address is returned. Everything is happening in one single thread. So the coherent result is guaranteed unless someone will pass h to some external variable or other thread. – dk14 Apr 18 '15 at 12:16
Or more simply saying, `h` isn't concern of JMM as long as it's value in stack, but not object's field. Don't think about its internal structure - just think about `h` :) – dk14 Apr 18 '15 at 12:22
Oh, good find re "invocations of virtual methods"! I'm not sure, [in context](http://www.cs.umd.edu/~pugh/java/memoryModel/jsr133.pdf), this is quite as strong a guarantee as you suggest. It disallows segfaults due to misread addresses, but does it guarantee the consistency of the contents of what's addressed? But let's suppose (I hope!) you are right. Note the subtle possible bug, then, if the method or class is declared final, enabling inlining at runtime. (An inversion of programmer intuitions, if declaring a method or class final renders a program less rather than more safe + predictable!) – Steve Waldman Apr 18 '15 at 12:26
Just think simpler. When you do `a = list.take(5)`, when a is some shared field (with adress #15), you actually do 1) calling `invokevirtual` and blocking until it returns 3) methods putting data into accumulator 4) you're doing putfield #15, which puts accumulator into #15. So only #15 is concern of JSR-133 here. Compiler may optimize whatever it want's even remove the #15, but value in accumulator is always fine, as no other thread did touch it. **No thread did touch new list before it's putten into accumulator** because no one knows about this newborn list before. – dk14 Apr 18 '15 at 12:31
**Any inlines won't affect it as long as `h` is mutated as part of the stack only** No compiler will optimize internal variable (part of stack) into the field (part of heap). Only the reverse is possible. – dk14 Apr 18 '15 at 12:37
and by `h` I mean the address of newely created object in the heap. The address itself is in stack during whole `take` execution (even if it's inlined), so no other thread can mutate this address as there is no way to get it unitl you assign it to the field. – dk14 Apr 18 '15 at 12:49
The issue here isn't other threads mutating. It's the possibility that other threads might read stale or inconsistent values. The item that is mutated lives on the heap. `h` on the stack is perfectly private and perfectly fine, sure. Stack variables always are. Memory inconsistencies come from inconsistencies in shared memory, necessarily different views of the heap. Nothing prevents the mutating thread from, say, allocating the object pointed to by `h` in main memory, maintaining a cache-local synthetic register to represent `h.tl` and mutate locally until forced to sync. – Steve Waldman Apr 18 '15 at 13:17
That would certainly be a legal optimization, whether or not it is a common one. Even if your interpretation of the virtual method guarantee is right, an inlined version of the method could set a widely accessible field to the address at `h`, which would be a perfectly correct address pointing to an object whose state, alas, happens to be stale, until the Thread executing `take(...)` decides at some undefined time to synchronize the heap object's field to the latest version of its cache-local register. – Steve Waldman Apr 18 '15 at 13:22
for non-inline - `areturn` will be synchronization event (any cache will be flushed before stack pointer moves back). – dk14 Apr 18 '15 at 13:25
Inline isn't case for scala library, as it's a public method - it can't be inlined – dk14 Apr 18 '15 at 13:37
Even if imagine, JSR-133 doesn't tell anything about inlining, but I suppose, that compiler should not optimize inlined code, so cache should be flushed (after returning from inline) anyway – dk14 Apr 18 '15 at 13:39
perhaps so! I'm a bit less certain of that because of [this](http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html): "If an object is properly constructed (which means that references to it do not escape during construction), then all threads which see a reference to that object will also see the values for its final fields that were set in the constructor, without the need for synchronization." That seems to suggest that even for constructors which have returned, _nonfinal_ fields may remain inconsistent without synchronization. – Steve Waldman Apr 18 '15 at 13:40
(I wouldn't rely on compilers / hotspot JVMs not optimizing inlined code unless they are forbidden to do so. That they are inlining at all is because they are optimizing. If JMM wanted to impose the semantics of a restriction in all circumstances, it might have not qualified "invocations of methods" with the word "virtual".) – Steve Waldman Apr 18 '15 at 13:45
I think they might see the reference to the allocated memory before the constructor is finished. – dk14 Apr 18 '15 at 13:46
JVM can't optimize inline code as it works with bytecode, itcan't just remove areturn instruction – dk14 Apr 18 '15 at 13:47
about scala - http://stackoverflow.com/questions/4593710/when-should-i-and-should-i-not-use-scalas-inline-annotation – dk14 Apr 18 '15 at 13:48
inlining code between `jar`s is obviously impossible, and scala inlines only explicitly annotated code (at least for public methods) – dk14 Apr 18 '15 at 13:49
[Hotspot JVMs](https://wikis.oracle.com/display/HotSpotInternals/PerformanceTechniques) certainly do inline code, and perform a lot of other optimizations not visible in bytecode! – Steve Waldman Apr 18 '15 at 13:52
Regardless of scala hints about inlining and what a scala compiler does, performant JVMs will do a lot of reordering and rearranging, including simple inlining of nonvirtual methods and conditional inlining of virtual methods. The Java concurrency is complicated mostly because bytecode presents an unreliably simplified version of code execution. – Steve Waldman Apr 18 '15 at 13:56
I'd never want code to be inlined across classes in bytecode, that's a maintenance nightmare, you modify your code but dependent libraries don't see the changes. A good compiler will inline private or final methods within a top-level class, I think (with no need of annotations or hints), but mostly inlining is a runtime (or JIT-compile-time) phenomenon. – Steve Waldman Apr 18 '15 at 13:59
JIT-optimizers still stricted by JSR-133 guarantees, so it's not inlining we're talking about - they copy the whole bytecode, but can't rearange operations on same register and/ or stack (as they still have to emulate `invokevirtual`). I'm not sure that even scala's compiler can do that. – dk14 Apr 18 '15 at 14:01
JSR-133 very much permits reordering, and there's no hint that returning from nonvirtual (final, private, or static) methods should force a synchronization, so reordering should be permitted of stacks that contain such inlined methods. – Steve Waldman Apr 18 '15 at 14:06
the byte-code which goes to Hotspot JVM has `invokevirtual` command anyway. `take` is definetly a virtual method. Hotspot JVM can't change the logic `flush cache before assigning to the field` logic, as it doesn't even know what is cache what is not cache here. – dk14 Apr 18 '15 at 14:07
Yes, if your interpretation of that sentence in JSR-133 is correct and the JVM is required to synchronize caches to main memory on method return, then the Hotspot JVM would certainly have to honor that behavior. I'm not sure your interpretation is right, but I hope so! Hotspot JVMs do mess around with virtual method invocation, optimizing for guesses about the objects that will be targets, but that is conditional, they always have to respect code semantics. – Steve Waldman Apr 18 '15 at 14:17
I also hope :). I've written a letter to Aleksandar Prokopec, maybe he would clear this out :). Have to go now, so let's wait for some other opinions. – dk14 Apr 18 '15 at 14:19
The JVM certainly does know, however, if it has optimized mutations to a cache-local register, and if your interpretation of JSR-133 is not correct (again, I hope it is!), the VM certainly could choose not to sync its local register to the main-memory object very promptly. – Steve Waldman Apr 18 '15 at 14:19
Thanks for the very thoughtful response and conversation! – Steve Waldman Apr 18 '15 at 14:20
Sorry that I'm arriving late to the party. – axel22 Aug 07 '15 at 20:43
I believe that this comment (by Steve Waldman) captures the gist of what I meant in the book: ---------------- Re the all-or-nothingness of modifications since an address is returned, I don't think that's formally right. (I suspect/hope it is right in practice though!) See e.g. cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html Even nonfinal fields set in a ctor are not guaranteed consistent across threads. Even though an address is returned, different values may result from dereferencing the address, ... ---------------- – axel22 Aug 07 '15 at 20:44
So, while for one thread an immutable collection always looks like an immutable collection, concurrent threads should not rely on unsafe publication to share immutable collections. – axel22 Aug 07 '15 at 20:44
For example, if thread T1 creates a List(1, 2, 3), it should not assign that list object to a field in some object: `var x: List[Int] = null` – axel22 Aug 07 '15 at 20:44
By doing so, the thread risks that the JIT compiler reorders instructions in the code so that the address of the list is assigned before the list is actually initialized. Another thread T2 can read a non-null value from the field `x` and see the incomplete list. The solution is to use the synchronized statement or volatiles: `@volatile var x: List[Int] = null` – axel22 Aug 07 '15 at 20:44
The comment that says that the JIT compiler would not reorder contents of virtual methods could mean that in practice assigning to a freshly constructed list to a non-volatile `var` might not cause data races like the one in my example. However, as a scala-library user, this is something that you should not rely on when writing correct concurrent programs - `scalac` optimizer could potentially inline those methods, or they might be replaced with non-virtual method calls. – axel22 Aug 07 '15 at 20:50
@axel22 could you please point us to some docs, that are saying that content of inlined method could be rearranged with content of outside one, as jsr-133 isn't much specific about this (I believe it describes only guarantees and optimizations about code inside of method only, not between them) – dk14 Aug 07 '15 at 21:09
I'm not aware of such docs, because, to the best of my knowledge, `scalac` optimizer does not at the moment inline the body of the `List.apply` constructor, in which case all code related to the expression `x = List(1, 2, 3)` could become entirely contained within a single method. However, there is no guarantee that `scalac` won't start doing this in the future. Additionally, there is no guarantee that the `@inline` method won't be added to `List.apply`, in which case, there are some docs about how inlining works here: http://magarciaepfl.github.io/scala/#WhatsNew – axel22 Aug 07 '15 at 21:22
@axel22 I actually meant a bit different thing. It's fine to inline methods (even between different modules), but I don't think that it's fine to rearrange normal code with inlined one after that. Disprooving of my second assumption is actually what I'm asking for. – dk14 Aug 07 '15 at 21:25
After the code has been inlined by the e.g. `scalac` compiler, it ends up being just a sequence of bytecode instructions. This means that after the inlining transformation, there is normally not many ways to tell the inlined code apart from the original code in the method. After the program is started, there is no guarantee that the JVM's JIT compiler won't rearrange individual instructions - from JIT's perspective, a method containing code that was inlined by `scalac`, is exactly the same as the normally written code. – axel22 Aug 07 '15 at 21:37
@axel22 Sorry, but I meant JIT's inlining (let's say physical one, not logical), as this is a dangerous moment here. Your previous link (thanks, btw) is pointing out that `scalac` is going to "only inline @inline-marked methods, and always inline them, including under separate-compilation". So `List.apply` etc should never be logically inlined in scala-library and I hope it won't for reasons you've mentioned. So, I still appelate that physical inlining shouldn't (must not) reoptimize code. – dk14 Aug 07 '15 at 21:54
You're welcome! :) I agree - JIT's inlining is a different thing. Still, I wouldn't count on the absence of `@inline` on `List` in stdlib would resolve the problem. The main problem with `List` is that it has a non-final field. This allows third-party (non-stdlib) code to also declare `@inline` methods that construct and return lists. If you call such a method and assign it to a field `x`, you could theoretically get a data race. I cannot say that I reproduced this, though. – axel22 Aug 07 '15 at 22:12
1

@axel22 I still believe you can't (as constructor itself won't be inlined). However, it's your birthday and I'm coincendantally drunk, so won't bother you anymore :). – dk14 Aug 07 '15 at 22:17
But the List constructor is not an `invokevirtual`, so non-final field assignments from the ctor might be reordered. That's at least what the JSR 133 FAQ and Goetz's Concurrency in Practice say, but maybe I'd have to re-read those specs :) Btw, have fun! :) – axel22 Aug 07 '15 at 22:32
@axel22 I'm really late to this party! but thank you very much for weighing in so thoroughly. I'm persuaded that you are right — in theory, I very much hope not in practice! — that the JMM simply doesn't provide strong enough guarantees to ensure correctness without synchronization or volatility when an object is only lazily immutable. (If you distill your explanation into an answer, I'll accept it.) – Steve Waldman Aug 20 '15 at 16:18

Thomas Krieger · Answer 2 · 2015-04-19T14:31:56.833

In your question you are asking for an authoritative statement. I found the following in "Programming in Scala" from Martin Odersky et al: "Third, there is no way for two threads concurrently accessing an immutable to corrupt its state once it has been properbly constructed, because no thread can change the state of an immutable"

If you look for example at the implementation you see that this is followed in the implementation, see below.

There are some fields inside vector which are not final and could lead to data races. But since they are only changed inside a method creating a new instance and since you need an Synchronization action to access the newly created instance in different threads anyway everyting is fine.

The pattern used here is to create and modify an object. Than make it visible to other threads, for example by assigning this instance to a volatile static or static final. And after that make sure that it is not changed anymore.

As an Example the creation of two vectors:

  val vector = Vector(4,5,5)
  val vector2 =  vector.updated(1, 2);

The method updated uses the var field dirty inside:

private[immutable] def updateAt[B >: A](index: Int, elem: B): Vector[B] = {
    val idx = checkRangeConvert(index)
    val s = new Vector[B](startIndex, endIndex, idx)
    s.initFrom(this)
    s.dirty = dirty
    s.gotoPosWritable(focus, idx, focus ^ idx)  // if dirty commit changes; go to new pos and prepare for writing
    s.display0(idx & 0x1f) = elem.asInstanceOf[AnyRef]
    s
  }

but since after creation of vector2 it is assigned to a final variable: Bytecode of variable declaration:

private final scala.collection.immutable.Vector vector2;

Byte code of constructor:

61  invokevirtual scala.collection.immutable.Vector.updated(int, java.lang.Object, scala.collection.generic.CanBuildFrom) : java.lang.Object [52]
64  checkcast scala.collection.immutable.Vector [48]
67  putfield trace.agent.test.scala.TestVector$.vector2 : scala.collection.immutable.Vector [22]

Everything is o.k.

`since after creation of vector2 it is assigned to a final variable` - if this variable is part of a stack (some internal method's variable) - it will be ok even if it's not final. Even with mutable field - nothing can happen with resulting collection (like loosing an elemnt) itself - races will just affect the vector2 itself . — dk14, Apr 18 '15 at 13:11

Must access to scala.collection.immutable.List and Vector be synchronized?

2 Answers2

Linked