4

The following code returns whether a given String s is equal to any other of the hard-coded strings. The method uses the switch-statement to do so:

public class SwitchOnString {
    public static boolean equalsAny(String s) {
        switch (s) {
        case "string 1":
            return true;
        case "string 2":
            return true;
        default:
            return false;
        }
    }
}

According to the Java Virtual Machine Specification (JMS) 3.10 Compiling Switches"

Compilation of switch statements uses the tableswitch and lookupswitch instructions.

Moreover

tableswitch and lookupswitch instructions operate only on int data.

I read chapter 3.10 but didn't find anywhere String mentioned.

The only one sentence which comes indirectly close is:

Other numeric types must be narrowed to type int for use in a switch.

Question 1:
Is String in this context also a numeric type? Or did I missed something?

A javap -c on the class SwitchOnString shows:

Compiled from "SwitchOnString.java"
public class playground.SwitchOnString {
  public playground.SwitchOnString();
   ...

  public static boolean equalsAny(java.lang.String);
    Code:
       0: aload_0
       1: dup
       2: astore_1
       3: invokevirtual #16                 // Method java/lang/String.hashCode:()I
       6: lookupswitch  { // 2
            1117855161: 32
            1117855162: 44
               default: 60
          }
   ...

}

Obviously the hashCode values are used as int-keys of the cases. This probably matches:

The lookupswitch instruction pairs int keys (the values of the case labels) ...

Going on to tableswitch and lookupswitch the JMS says:

The tableswitch instruction is used when the cases of the switch can be efficiently represented as indices into a table of target offsets. (...) Where the cases of the switch are sparse, the table representation of the tableswitch instruction becomes inefficient in terms of space. The lookupswitch instruction may be used instead.

If I get this right then the more the cases become sparse the more probably the lookupswitch will be used.

Question 2:
But looking at the bytecode:
Are two string cases sparse enough to compile switch to lookupswitch? Or will every switch on String be compiled to lookupswitch?

LuCio
  • 5,055
  • 2
  • 18
  • 34
  • 1
    String isn't a numeric type, but their hashCodes are. But there's also an `equals` to disallow identical hashCodes. I'm not a huge fan of the syntactic sugar, the only use case (for me) is when I have to consume strings and do something with the raw data. – Dave Newton Jul 26 '18 at 14:18
  • @DaveNewton Which `equals` do you mean? Looking at `String.equals` I don't see how it could "_disallow identical hashCodes_". Can you explain it more? And do you mean, String `case`s are generally sparse because of potential conflicts of their `hashCode`s? – LuCio Jul 26 '18 at 14:27
  • 1
    Although random Strings are likely to have different hashCodes, some trivial ones have the same hashCode so equals is also needed to check the contents of the String is as expected. – Peter Lawrey Jul 26 '18 at 14:56
  • 1
    @PeterLawrey Thx for the explanation. Now I got it. `equals` is meant to resolve the conflict of hashCodes. – LuCio Jul 26 '18 at 15:04
  • 3
    This feels like an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). A switch statement is *woefully* unequipped to actually do this kind of matching. If you want a set of strings, use `Set` instead. I'd like to answer to this effect but I'm very concerned. – Makoto Jul 26 '18 at 15:10
  • Agreed; if you're just doing a true/false to check for the presence of a string, there are better options. – Dave Newton Jul 26 '18 at 15:18
  • @Makoto I know the XY problem. I used the code just to have a showcase. And I shouldn't have use the term set (though I wrote it by intend in lowercase). I admit it's suggesting I could have many strings. It was just meant to introduce my question on `switch` with `String`. As the bytecode turned out the _lookupswitch_ was already used with two string cases. It seems to be independent of how many strings would be compared. – LuCio Jul 26 '18 at 15:42
  • I've updated the question. I hope it's obvious now that the code is serving as an example to introduce the question. My intention is just to understand why the JVM behaves this way as it seems not to be mentioned in the specification. – LuCio Jul 26 '18 at 17:00

1 Answers1

2

The specification doesn’t tell how to compile switch statements, that’s up to the compiler.

In that regard, the JVMS statement, “Other numeric types must be narrowed to type int for use in a switch” does not say that the Java programming language will do such a conversion nor that String or Enum are numeric types. I.e. long, float and double are numeric types, but there’s no support for using them with switch statements in the Java programming language.

So the language specification says that switch over String is supported, hence, the compilers must find a way to compile them to bytecode. Using an invariant property like the hash code is a common solution, but in principle, other properties like the length or an arbitrary character could be used as well.

As discussed in “Why switch on String compiles into two switches” and “Java 7 String switch decompiled: unexpected instruction”, javac currently generates two switch instructions on the bytecode level when compiling switch over String values (ECJ also generates two instructions, but details may differ).

Then, the compiler has to pick either, a lookupswitch or tableswitch instruction. javac does use tableswitch when the numbers are not sparse, but only if the statement has more than two case labels.

So when I compile the following method:

public static char two(String s) {
    switch(s) {
        case "a": return 'a';
        case "b": return 'b';
    }
    return 0;
}

I get

public static char two(java.lang.String);
Code:
   0: aload_0
   1: astore_1
   2: iconst_m1
   3: istore_2
   4: aload_1
   5: invokevirtual #9                  // Method java/lang/String.hashCode:()I
   8: lookupswitch  { // 2
                97: 36
                98: 50
           default: 61
      }
  36: aload_1
  37: ldc           #10                 // String a
  39: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  42: ifeq          61
  45: iconst_0
  46: istore_2
  47: goto          61
  50: aload_1
  51: ldc           #12                 // String b
  53: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  56: ifeq          61
  59: iconst_1
  60: istore_2
  61: iload_2
  62: lookupswitch  { // 2
                 0: 88
                 1: 91
           default: 94
      }
  88: bipush        97
  90: ireturn
  91: bipush        98
  93: ireturn
  94: iconst_0
  95: ireturn

but when I compile,

public static char three(String s) {
    switch(s) {
        case "a": return 'a';
        case "b": return 'b';
        case "c": return 'c';
    }
    return 0;
}

I get

public static char three(java.lang.String);
Code:
   0: aload_0
   1: astore_1
   2: iconst_m1
   3: istore_2
   4: aload_1
   5: invokevirtual #9                  // Method java/lang/String.hashCode:()I
   8: tableswitch   { // 97 to 99
                97: 36
                98: 50
                99: 64
           default: 75
      }
  36: aload_1
  37: ldc           #10                 // String a
  39: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  42: ifeq          75
  45: iconst_0
  46: istore_2
  47: goto          75
  50: aload_1
  51: ldc           #12                 // String b
  53: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  56: ifeq          75
  59: iconst_1
  60: istore_2
  61: goto          75
  64: aload_1
  65: ldc           #13                 // String c
  67: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  70: ifeq          75
  73: iconst_2
  74: istore_2
  75: iload_2
  76: tableswitch   { // 0 to 2
                 0: 104
                 1: 107
                 2: 110
           default: 113
      }
 104: bipush        97
 106: ireturn
 107: bipush        98
 109: ireturn
 110: bipush        99
 112: ireturn
 113: iconst_0
 114: ireturn

It’s not clear why javac makes this choice. While tableswitch has a higher base footprint (one additional 32 bit word) compared to lookupswitch, it would still be shorter in bytecode, even for the two case labels scenario.

But the consistency of the decision can be shown with the second statement, which will always have the same value range, but compiles to lookupswitch or tableswitch depending only on the number of labels. So when using truly sparse values:

public static char three(String s) {
    switch(s) {
        case "a": return 'a';
        case "b": return 'b';
        case "": return 0;
    }
    return 0;
}

it compiles to

public static char three(java.lang.String);
Code:
   0: aload_0
   1: astore_1
   2: iconst_m1
   3: istore_2
   4: aload_1
   5: invokevirtual #9                  // Method java/lang/String.hashCode:()I
   8: lookupswitch  { // 3
                 0: 72
                97: 44
                98: 58
           default: 83
      }
  44: aload_1
  45: ldc           #10                 // String a
  47: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  50: ifeq          83
  53: iconst_0
  54: istore_2
  55: goto          83
  58: aload_1
  59: ldc           #12                 // String b
  61: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  64: ifeq          83
  67: iconst_1
  68: istore_2
  69: goto          83
  72: aload_1
  73: ldc           #13                 // String
  75: invokevirtual #11                 // Method java/lang/String.equals:(Ljava/lang/Object;)Z
  78: ifeq          83
  81: iconst_2
  82: istore_2
  83: iload_2
  84: tableswitch   { // 0 to 2
                 0: 112
                 1: 115
                 2: 118
           default: 120
      }
 112: bipush        97
 114: ireturn
 115: bipush        98
 117: ireturn
 118: iconst_0
 119: ireturn
 120: iconst_0
 121: ireturn

using lookupswitch for the sparse hash codes, but tableswitch for the second switch.

Holger
  • 285,553
  • 42
  • 434
  • 765
  • Thank you for this in depth analysis. Regarding my first question I understand that `String` isn't a numeric type just like `Enum` is not. There is just no further specification for how to `switch` a `String`. It's up to the compiler. The second question can be answered quick with: No - it depends on the number of labels. Did I get you right? – LuCio Jul 26 '18 at 19:28
  • @LuCio yes, you got it right. To be more precise, the answer to the second question is *in case of `javac`*, it depends on the number of labels. Other compilers may handle it differently. – Holger Jul 27 '18 at 08:39
  • Thx for confirmation and clarification. – LuCio Jul 27 '18 at 08:42