31

I'm practicing lambda expressions in Java. I know local variables need to be final or effectively final according to the Oracle documentation for Java SE 16 Lambda Body :

Any local variable, formal parameter, or exception parameter used but not declared in a lambda expression must either be final or effectively final (§4.12.4), as specified in §6.5.6.1.

It doesn't say why though. Searching I found this similar question Why do variables in lambdas have to be final or effectively final?, where StackOverflow user "snr" responded with the next quote:

Local variables in Java have until now been immune to race conditions and visibility problems because they are accessible only to the thread executing the method in which they are declared. But a lambda can be passed from the thread that created it to a different thread, and that immunity would therefore be lost if the lambda, evaluated by the second thread, were given the ability to mutate local variables.

This is what I understand: a method can only be executed by one thread (let's say thread_1) at a time. This ensures the local variables of that particular method are modified only by thread_1. On the other hand, a lambda can be passed to a different thread (thread_2), so... if thread_1 finishes with the lambda expression and keeps executing the rest of the method it could change the values of the local variables, and, at the same time, thread_2 could be changing the same variables within the lambda expression. Then, that's why this restriction exists (local variables need to be final or effectively final).

Sorry for the long explanation. Am I getting this right?

But the next questions would be:

  • Why isn't this case applicable to instance variables?
  • What could happen if thread_1 changes instance variables at the same time as thread_2 (even if they are not executing a lambda expression)?
  • Are instance variables protected in another way?

I don't have much experience with Java. Sorry if my questions have obvious answers.

Naman
  • 27,789
  • 26
  • 218
  • 353
DamianGDO
  • 449
  • 4
  • 8
  • 2
    There are some good explanations with example here - https://www.baeldung.com/java-lambda-effectively-final-local-variables, not sure whether you have read it – aksappy Apr 12 '21 at 20:32
  • 7
    "a method can only be executed by one thread (let's say thread_1) at a time" => nope, however the local variables are "initialised and separate" each time the method is executed. – assylias Apr 12 '21 at 20:32
  • 2
    Local variables are handled differently to fields. A reference to a field is certain given a reference to its containing object. Not so with a local variable when its value changes. – Bohemian Apr 12 '21 at 20:39

3 Answers3

44

The issue has nothing to do with thread safety, really. There's a simple, straightforward answer to why instance variables can always be captured: this is always effectively final. That is, there is always one known fixed object at the time of the creation of a lambda accessing an instance variable. Remember that an instance variable named foo is always effectively equivalent to this.foo.

So

class MyClass {
  private int foo;
  public void doThingWithLambda() {
    doThing(() -> { System.out.println(foo); })
  }
}

can have the lambda rewritten as doThing(() -> System.out.println(this.foo); }) and is therefore equivalent to

class MyClass {
  private int foo;
  public void doThingWithLambda() {
    final MyClass me = this;
    doThing(() -> { System.out.println(me.foo); })
  }
}

...except this is already final and doesn't need to be copied to another local variable (though the lambda will capture the reference).

All of the normal thread-safety caveats apply, of course. If your lambdas get passed to multiple threads and modify variables, then exactly the same things would happen if lambdas weren't used, and no extra thread-safety applies beyond the thread safety of your variables (e.g. if they are volatile) or if your lambdas use other mechanisms to safely access the variables. Lambdas do nothing special about thread-safety at all, and they don't do anything special with instance variables, either; they just capture a reference to this instead of to the instance variable.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • 2
    This answer is at least partially incorrect. The JLS itself notes the issue of concurrency. And `this` is a keyword representing a value, not a variable that can be effectively final. From the JLS: *"Similar rules on variable use apply in the body of an inner class (§8.1.3). The restriction to effectively final variables prohibits access to **dynamically-changing local variables**, whose capture would likely introduce **concurrency problems**."* – Andy Thomas Apr 13 '21 at 16:07
  • is "`this` is effectively final" a basic assertion of JLS? where is that derived from? – cat Apr 13 '21 at 20:00
  • 4
    @cat: It can be used as a reference, and it never changes. It's not literally a variable, but it behaves like a final variable as far as everything in this post discusses. – Louis Wasserman Apr 13 '21 at 20:05
  • 3
    I think this answer could be better if you first explain that lambdas capture local variables by _copying_ their value -- hence why they must be effectively final; otherwise a caller could observe that their value didn't match. So, since `this` is "effectively effectively final," the lambda can capture it by copying the reference. – Chris Bouchard Apr 14 '21 at 01:36
24

The other answers already provide great context around why this is a limitation in Java. I'd like to offer some background on how other languages deal with this when they don't enforce the requirement that local variables be considered immutable (i.e. final).

The main point suggested is that "heap" values (i.e. fields) are intrinsically accessible from other threads, whereas "stack" values (i.e. local variables) are intrinsically accessible only from within the method that declared the values. This is true. So since fields are stored on the heap, they can be mutated after the method has completed. In contrast, stack values go away as soon as the method finishes.

Java chooses to honor these semantics, so a local variable must never be modified after the method completes. This is a fair design decision. However, some languages do choose to allow mutation to local variables after the method exits. So how can that be?

In C# (the language I'm most familiar with, but other languages such as JavaScript also allow these constructs) when you reference a local variable inside of a lambda, the compiler detects that and behind the scenes actually generates a whole new class to store the local variable. So instead of the variable being declared on the stack, the compiler detects that it's been referenced inside of a lambda, and so instead instantiates that class to store the value. So this (behind the scenes) behavior turns the stack value into a heap value. (you can actually decompile such code and see these compiler generated classes)

This decision isn't without cost. It's obviously more expensive to instantiate a class just to house, for example, an integer. In Java, you are guaranteed this will never happen. In a language such as C#, it requires careful reasoning to know whether your variable has been "lifted" into that generated class.

So ultimately the rationale becomes one of a design decision. In Java you can't shoot yourself in the foot. In C# they decided that most of the time the performance consequences aren't that big of a deal.

That said, C#'s decision has often been a source of confusion and bugs, particularly around the loop iterator variable in a for loop (the loop variable i can (and must) be mutated) and passed to a lambda, as described in Eric Lippert's blog post. It was so problematic that they decided to introduce a (rare) breaking change to the compiler for the foreach variant.

On the other hand, I've enjoyed the freedom to mutate local variables inside of a lamda in C#. But neither decision comes without cost.

This answer is definitely not trying to advocate on either decision, but I thought it was worthwhile to elaborate on some of these design choices.

Kirk Woll
  • 76,112
  • 22
  • 180
  • 195
  • 3
    Small correction: in C#, a variable is lifted to the closure class even if it is not mutated in the lambda. Suffices if it is just referenced. – ach Apr 13 '21 at 09:09
  • @ach thanks for the correction, right you are! I've made the requisite edits. – Kirk Woll Apr 13 '21 at 12:07
  • 3
    This behavior has more consequences than performance. It implies that suddenly, the programmer is responsible for ensuring the thread safety of local variables. In Java, the local variables are immune to data races, which doesn’t apply when you can turn local variables into shared mutable variables. But you can’t, for example, declare a local variable as `volatile` in Java. That’s not possible, as it was never needed. Since you also can’t synchronize on the instance of the synthetic class, ensuring thread safe local variables suddenly becomes more complicated than ensuring thread safe fields. – Holger Apr 13 '21 at 14:13
  • 3
    @ach: When I was on the C# team -- almost ten years ago now -- we considered doing some optimizations which would distinguish between mutated vs merely read outer variables of a lambda, and capture the latter "by value" rather than capturing the variable. I never ended up making such optimizations; it sounds like the team has not done so in the years since, but I would not be surprised at all if they do so someday. – Eric Lippert Apr 13 '21 at 18:14
  • 1
    Similarly we considered optimizations for generating multiple closure classes; it is fairly common to have two lambdas whose outer variables are disjoint but they get one closure, which means that the lifetimes of all outer variables are extended to the lifetime of the longest-lived delegate, which is unnecessary and surprising. I don't know if that optimization was ever implemented. – Eric Lippert Apr 13 '21 at 18:16
  • 2
    @Holger: Your points are correct and well taken, but its worthwhile to note that since C# 2.0 it has always been the case that local variables can be modified in unexpected ways and unexpected orders, and you need neither anonymous methods nor lambdas nor even multithreading to fall victim to such races. Coroutines -- iterator blocks in C# 2.0 and async methods in C# 6.0 -- also have the property that they hoist locals to the heap and extend their lifetimes because coroutine activations do not form a stack. – Eric Lippert Apr 13 '21 at 18:19
  • @EricLippert: Even if some variables are shared, a compiler could create a class for each subset of variables that is shared by a different collection of closures, and then have each closure hold references to any such class objects that it actually needs. – supercat Apr 13 '21 at 19:25
  • @supercat That, of course, has a different trade-off: creation of closures and access of a captured variable become even slower. – Joker_vD Apr 13 '21 at 22:22
  • @Joker_vD: Those are trade-offs, but IMHO semantics should have priority over performance. Though I also happen to think the right approach would have been to require that closures expressly indicate what variables they are closing over, and whether the closure is by value, by reference, or compiler's choice. – supercat Apr 13 '21 at 22:45
  • 2
    @supercat: Re: as many closure classes as there are subsets of outer vars, yes, that was exactly the optimization we considered. – Eric Lippert Apr 14 '21 at 02:26
  • 2
    @supercat: Re: indicating in the language: when we were designing lambdas for C# 3.0 Herb Sutter randomly stopped by my office and we had a very entertaining conversation about how C++ was doing exactly that, and what the pros and cons were. Obviously the C# team decided on not adding a syntax for indicating desired closure semantics. In retrospect, I kinda wish that we had made it easier to statically detect and disallow LINQ query comprehensions that closed over modifiable variables, as that turned out to be a rich source of user errors. – Eric Lippert Apr 14 '21 at 02:33
  • 1
    @supercat: (I am still not sure what exactly motivated Herb to stop by that one time; I assume that someone told him that we were pondering this problem when he happened to be in Redmond.) – Eric Lippert Apr 14 '21 at 02:35
  • 2
    @Joker_vD: I note that the same tradeoff exists for the "transparent identifiers" introduced by query comprehension rewriting. They desugar into types where you can end up drilling down through several levels of dereferencing to get to a variable. But if you're building a big query with a lot of SelectMany clauses, odds are pretty good that time spent accessing the range variables is going to be the least of your performance worries. – Eric Lippert Apr 14 '21 at 02:37
  • 2
    @EricLippert yes, for C#, the road has been taken which makes it easier to decide to repeat it for other features. For Java, this does not apply, which makes staying with only sharing (effectively) final local variables variables attractive. Which means that supercat’s idea of letting the programmer decide exists in Java. Just use a local variable for “by value” semantic or explicitly create the class holding the variable for “by reference” semantic. – Holger Apr 14 '21 at 07:04
  • @EricLippert, I think that that a captured variable is not modified inside a lambda does not suffice as a condition that allows its copying. The variable may be modified 1) in the lambda itself, 2) in another lambda that captures it, and 3) in the function itself. Also, modification by itself does not impede such an optimization if we can prove that other "threads" are not observing the variable's value (e. g. `{ int i = ...; i += 1; ... () => { i += 1; }; }`. – ach Apr 14 '21 at 19:35
  • @ach: That's correct, and there are also aliasing concerns when you throw `in`, `out`, `ref` calling conventions into the mix, and of course it is not legal in the CLR to close over an aliased parameter, and so it goes. There's a lot of stuff you've got to keep track of which is why we never got around to making that optimization on my watch. – Eric Lippert Apr 14 '21 at 20:08
  • @Holger In all Java codebases I've seen (all 3 of those!), the "by reference" semantics is usually done by passing a 1-element long array around. Much shorter than creating a dedicated class every time. – Joker_vD Apr 14 '21 at 21:41
  • 1
    @Joker_vD passing an array may be shorter in source code but can be slightly less efficient at runtime. Still, it’s an often used approach. My comment was not meant to be exhaustive regarding the possibilities to denote the by-reference semantics, all it wanted to say, is that the choice exists. – Holger Apr 15 '21 at 08:10
  • 1
    @ach but all of these potentially modifying uses are within the same syntactical unit (the scope of the local variable), so it’s possible to prove the existence or absence. After all, even when Java takes the different route of forbidding mutation, it’s relying on the ability to spot such modifications, to deny them. It has no `out` or `ref` parameters, though, but a compiler should know when a parameter variable is of that kind. So, such a check is possible. Proving that no other thread can see a variable is a different beast. – Holger Apr 15 '21 at 08:14
19

Instance variables are stored in the heap space whereas local variables are stored in the stack space. Each thread maintains its own stack and hence the local variables are not shared across the threads. On the other hand, the heap space is shared by all threads and therefore, multiple threads can modify an instance variable. There are various mechanisms to make the data thread-safe and you can find many related discussions on this platform. Just for the sake of completeness, I've quoted below an excerpt from http://web.mit.edu/6.005/www/fa14/classes/18-thread-safety/

There are basically four ways to make variable access safe in shared-memory concurrency:

  • Confinement. Don’t share the variable between threads. This idea is called confinement, and we’ll explore it today.
  • Immutability. Make the shared data immutable. We’ve talked a lot about immutability already, but there are some additional constraints for concurrent programming that we’ll talk about in this reading.
  • Threadsafe data type. Encapsulate the shared data in an existing threadsafe data type that does the coordination for you. We’ll talk about that today.
  • Synchronization. Use synchronization to keep the threads from accessing the variable at the same time. Synchronization is what you need to build your own threadsafe data type.
Arvind Kumar Avinash
  • 71,965
  • 6
  • 74
  • 110