4

This is a question about performance. I can convert from uppercase to lowercase and vice versa by using this code:

From lowercase to uppercase:

// Uppercase letters. 
class UpperCase {  
  public static void main(String args[]) { 
    char ch;
    for (int i = 0; i < 10; i++) { 
      ch = (char) ('a' + i);
      System.out.print(ch); 

      // This statement turns off the 6th bit.   
      ch = (char) ((int) ch & 65503); // ch is now uppercase
      System.out.print(ch + " ");  
    } 
  } 
}

From uppercase to lowercase:

// Lowercase letters. 
class LowerCase {  
  public static void main(String args[]) { 
    char ch;
    for (int i = 0; i < 10; i++) { 
      ch = (char) ('A' + i);
      System.out.print(ch);
      ch = (char) ((int) ch | 32); // ch is now lowercase
      System.out.print(ch + " ");  
    } 
  } 
}

I know that Java provides the following methods: .toUpperCase( ) and .toLowerCase( ). Thinking about performance, what is the fastest way to do this conversion, by using bitwise operations the way I showed it in the code above, or by using the .toUpperCase( ) and .toLowerCase( ) methods? Thank you.

Edit 1: Notice how I am using decimal 65503, which is binary ‭1111111111011111‬. I am using 16 bits, not 8. According to the answer currently with more votes at How many bits or bytes are there in a character?:

A Unicode character in UTF-16 encoding is between 16 (2 bytes) and 32 bits (4 bytes), though most of the common characters take 16 bits. This is the encoding used by Windows internally.

The code in my question is assuming UTF-16.

alexrnov
  • 2,346
  • 3
  • 18
  • 34
Jaime Montoya
  • 6,915
  • 14
  • 67
  • 103
  • You can run some tests to find out. Or read the documentation to see how `.toUpperCase()` or `.toLowerCase()` is implemented. Remove the print statements from your code before you run tests. And make sure you use a sufficiently large sample size. And read this before you do any testing. https://stackoverflow.com/questions/447739/java-performance-testing. – clinomaniac Mar 08 '18 at 21:05
  • 2
    Why care for something as insignificant as the performance of case conversion? – Kayaman Mar 08 '18 at 21:06
  • Possible duplicate of [How do I write a correct micro-benchmark in Java?](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) – Bernhard Barker Mar 08 '18 at 21:08
  • @Kayaman Because I am working in code optimization and I need to justify why I am writing code the way I am doing it, and how the way I am writing the code is good in terms of performance. – Jaime Montoya Mar 08 '18 at 21:09
  • 1
    I'll benchmark this with JMH in a couple hours if it hasn't already been by then. I suspect that the bitwise operators will perform better, as `Character#toLowerCase` and `Character#toUpperCase` both perform numerous checks on the argument. – Jacob G. Mar 08 '18 at 21:10
  • 1
    whatever you have in place *should* be faster than `toUpperCase()/toLowerCase()` – Eugene Mar 08 '18 at 21:12
  • 1
    @JaimeMontoya but you can't justify micro-optimization. You're wasting time doing something that doesn't matter. Or did you actually identify case conversions as performance hotspots by profiling your software? – Kayaman Mar 08 '18 at 21:12
  • 1
    @JacobG. I was suspecting that bitwise operators were faster, but I am trying to know for sure. – Jaime Montoya Mar 08 '18 at 21:14
  • 1
    @Kayaman No, I did not identify performance hotspots. But because I have not identified performance hotspots yet and I do not want to be guessing, the safest approach would be to write code with performance in mind. It does not hurt to use the fastest version, even if the performance improvement is insignificant. – Jaime Montoya Mar 08 '18 at 21:17
  • 1
    @JaimeMontoya that's a common beginner mistake. You've got guaranteed working code provided by the JDK that you're trying to optimize without knowing anything about any possible performance advantages. A professional would find places where it makes sense to program for performance. I do understand your way of thinking though, I thought the same when I was beginning programming in my teens. Since you have a limited amount of time though, you can't rewrite everything because you might get a 0.001% performance advantage from it. – Kayaman Mar 08 '18 at 21:23
  • @Kayaman this comment is better than the answer. so true... – Eugene Mar 08 '18 at 21:24
  • @Kayaman I appreciate your feedback. You are totally right, the best I can do is to research this by myself by using sophisticated tools and prove something with data, with numbers, testing for performance. However, I started by asking the question here in case someone could provide a solid answer before doing the sophisticated research myself with performance metrics and tools. Nonetheless, I think this conversation has moved to another direction. Why I want to know it is not the fundamental point. I have a question and it is secondary to reveal why, where, and how I will use this. – Jaime Montoya Mar 08 '18 at 21:29
  • 1
    I will comment this same thing on every single question about micro optimization. You can rewrite every single method you want instead of using what Java provides out of the box, but I will not let you think you're doing something smart. But like I said, I understand the train of thought. It just doesn't really work that way, and eventually you'll realize how right I (and any other experienced developer) was. – Kayaman Mar 08 '18 at 21:35
  • @Kayaman I totally agree with you. But even if it is only for research purposes, I have asked a valid question and I have not found a solid answer yet. – Jaime Montoya Mar 08 '18 at 21:40
  • Then you should learn how to use [JMH](http://openjdk.java.net/projects/code-tools/jmh/) instead of waiting for JacobG to do your performance tests for you. After all, you were supposed to be the performance programmer so you should definitely know how to do profiling and microbenchmarks. – Kayaman Mar 08 '18 at 21:44
  • @Kayaman Good point. JMH is a new tool in my list of things to learn. I honestly thought that my question was going to be easy to answer quickly by the community. There are basic concepts such as "Multiplication is faster than division" as Philip Couling said at https://stackoverflow.com/questions/17883240/is-multiplication-faster-than-float-division. But you have focused on calling me a beginner and criticizing my question rather than providing an answer. Do you know the answer? – Jaime Montoya Mar 08 '18 at 21:51
  • You mean the answer to "which is faster, `.toUpperCase()` or bitwise operations"? Of course I know the answer. It's faster to use bitwise operations because you'd only be supporting basic ASCII instead of Unicode like Java does. It'll be faster because you'll be doing less. Not much faster of course, but still. It's important to criticize questions, because a lot of people don't actually understand what they're asking, and they're surprised that they don't get a simple answer, because they're thinking in simple ways. – Kayaman Mar 08 '18 at 21:56
  • 1
    Letter case transformations are a locale-specific operation. Which locale do you want to be correct in? And do you want it to be correct only for certain codepoints? (Of course, the standard libraries are correct for all.) – Tom Blodget Mar 08 '18 at 21:59
  • @Kayaman Cool. That was all I needed to know. So please write it as an answer so that I can choose it as the best answer. I also learned that performance-wise, the benefit would be insignificant, and that I need to learn JMH to test this myself and have real metrics/numbers for this performance comparison. – Jaime Montoya Mar 08 '18 at 21:59
  • 1
    @JacobG. Thank you for offering to benchmark this with JMH. I should start learning JMH or equivalent tools myself. It will be great to see exactly by how much performance changes between the versions of the code that I mentioned in my question. – Jaime Montoya Mar 08 '18 at 22:20

4 Answers4

7

Yes a method written by you will be slightly faster if you choose to perform the case conversion with a simple bitwise operation, whereas Java's methods have more complex logic to support unicode characters and not just the ASCII charset.

If you look at String.toLowerCase() you'll notice that there's a lot of logic in there, so if you were working with software that needed to process huge amounts of ASCII only, and nothing else, you might actually see some benefit from using a more direct approach.

But unless you are writing a program that spends most of its time converting ASCII, you won't be able to notice any difference even with a profiler (and if you are writing that kind of a program...you should look for another job).

Kayaman
  • 72,141
  • 5
  • 83
  • 121
  • I'm running a really sensitive application and the bitwise operations gave me a measurable performance boost vs String.toLowerCase – Coder Dec 29 '20 at 16:44
  • 1
    @JohnD can you run the JMH benchmark in Jacob G's answer and see if the results are similar? If the performance boost is due to environmental differences (JDK etc.), they might be interesting to see, so you could post them as another answer. – Kayaman Jan 05 '21 at 13:55
  • TBH I was doing a programming competition on leetcode and the bitwise ops with 32 took 7 ms fewer than String.toLowerCase. Leetcode has leaderboards so I assume they account for it, but in either case this one change got me from only beating 46% of solutions for the question to beating over 98% of submissions – Coder Jan 05 '21 at 14:25
5

Your code only works for ANSII characters. What about languages where no clear conversion between lowercase and uppercase exists e.g. German ß (please correct me if I'm wrong my German is horrible) or when a letter/symbol is written using multi-byte UTF-8 code point. Correctness comes before performance and the problem is not so simple if you have to handle UTF-8, as evident in String.toLowerCase(Locale) method.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • 2
    The argument is `String` and the iteration is over its contained `char`s so the encoding is UTF-16, never UTF-8. But you are right that bit flipping isn't going to work for many of Unicode's 50442 letters. – Tom Blodget Mar 08 '18 at 22:37
  • In the code I used in my question, the encoding is UTF-16 to support Unicode. Notice how I am using in the code the value 65503 decimal, which is this in binary: ‭1111111111011111‬. That is 16 bits, not 8, meaning UTF-16 for Unicode. – Jaime Montoya Mar 13 '18 at 15:03
5

As promised, here are two JMH benchmarks; one comparing Character#toUpperCase to your bitwise method, and the other comparing Character#toLowerCase to your other bitwise method. Note that only characters within the English alphabet were tested.

First Benchmark (to uppercase):

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Fork(3)
public class Test {

    @Param({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m",
            "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"})
    public char c;

    @Benchmark
    public char toUpperCaseNormal() {
        return Character.toUpperCase(c);
    }

    @Benchmark
    public char toUpperCaseBitwise() {
        return (char) (c & 65503);
    }
}

Output:

Benchmark                (c)  Mode  Cnt  Score   Error  Units
Test.toUpperCaseNormal     a  avgt   30  2.447 ± 0.028  ns/op
Test.toUpperCaseNormal     b  avgt   30  2.438 ± 0.035  ns/op
Test.toUpperCaseNormal     c  avgt   30  2.506 ± 0.083  ns/op
Test.toUpperCaseNormal     d  avgt   30  2.411 ± 0.010  ns/op
Test.toUpperCaseNormal     e  avgt   30  2.417 ± 0.010  ns/op
Test.toUpperCaseNormal     f  avgt   30  2.412 ± 0.005  ns/op
Test.toUpperCaseNormal     g  avgt   30  2.410 ± 0.004  ns/op

Test.toUpperCaseBitwise    a  avgt   30  1.758 ± 0.007  ns/op
Test.toUpperCaseBitwise    b  avgt   30  1.789 ± 0.032  ns/op
Test.toUpperCaseBitwise    c  avgt   30  1.763 ± 0.005  ns/op
Test.toUpperCaseBitwise    d  avgt   30  1.763 ± 0.012  ns/op
Test.toUpperCaseBitwise    e  avgt   30  1.757 ± 0.003  ns/op
Test.toUpperCaseBitwise    f  avgt   30  1.755 ± 0.003  ns/op
Test.toUpperCaseBitwise    g  avgt   30  1.759 ± 0.003  ns/op

Second Benchmark (to lowercase):

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Fork(3)
public class Test {

    @Param({"A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M",
            "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z"})
    public char c;

    @Benchmark
    public char toLowerCaseNormal() {
        return Character.toUpperCase(c);
    }

    @Benchmark
    public char toLowerCaseBitwise() {
        return (char) (c | 32);
    }
}

Output:

Benchmark                (c)  Mode  Cnt  Score   Error  Units
Test.toLowerCaseNormal     A  avgt   30  2.084 ± 0.007  ns/op
Test.toLowerCaseNormal     B  avgt   30  2.079 ± 0.006  ns/op
Test.toLowerCaseNormal     C  avgt   30  2.081 ± 0.005  ns/op
Test.toLowerCaseNormal     D  avgt   30  2.083 ± 0.010  ns/op
Test.toLowerCaseNormal     E  avgt   30  2.080 ± 0.005  ns/op
Test.toLowerCaseNormal     F  avgt   30  2.091 ± 0.020  ns/op
Test.toLowerCaseNormal     G  avgt   30  2.116 ± 0.061  ns/op

Test.toLowerCaseBitwise    A  avgt   30  1.708 ± 0.006  ns/op
Test.toLowerCaseBitwise    B  avgt   30  1.705 ± 0.018  ns/op
Test.toLowerCaseBitwise    C  avgt   30  1.721 ± 0.022  ns/op
Test.toLowerCaseBitwise    D  avgt   30  1.718 ± 0.010  ns/op
Test.toLowerCaseBitwise    E  avgt   30  1.706 ± 0.009  ns/op
Test.toLowerCaseBitwise    F  avgt   30  1.704 ± 0.004  ns/op
Test.toLowerCaseBitwise    G  avgt   30  1.711 ± 0.007  ns/op

I've only included a few different letters (even though all were tested), as they are all share similar outputs.

It's clear that your bitwise methods are faster, mainly due to Character#toUpperCase and Character#toLowerCase performing logical checks (as I had mentioned earlier today in my comment).

Jacob G.
  • 28,856
  • 5
  • 62
  • 116
  • IMO the bitwise version should make *some* kind of check to see that the character actually needs to be processed (I'm talking about something like `if(c >= 'a' && c <='z')`) to make the comparison more similar and realistic. Although I'm surprised that the bitwise version isn't faster than that. – Kayaman Mar 09 '18 at 06:48
4

Just stick to the provided methods .toLowerCase() and .toUpperCase(). Adding two separate classes to perform two methods that have already been provided by java.lang is an overkill and will slow down your program(with a small margin).

MG_Bautista
  • 2,593
  • 2
  • 18
  • 33
S.Tushinov
  • 548
  • 5
  • 13