Clarification on how character encodings work

Question

I am writing a program to get the "sum" of a word, based on letters (i.e. "abc" = a+b+c = 1+2+3 = 6). I am using the method of total += (int) char - 'a' + 1 (in Java). The program is to be case insensitive ('A' = 'a'), so first I want to convert the char to lowercase if necessary. I have written

if (char < 'a') {char += 32;}

which is correct in UTF-16 and ASCII, but not UTF-8.

My question is, if I were to ship this code, how does encoding work past compiling? If the user is using UTF-8, will the program fail (so it's better to use Character.toLowerCase()), or since the program is in Java, any characters in the program will be the program's encoding, hence it works?

In case it isn't clear, I have no idea what I'm talking about, so some general info about how the encoding works would be great too.

Java uses UTF-16 by default, but you can set the string's charset in the constructor. You can use String.toLowerCase to make it case-insensitive. Here is a table of standard ASCII character values: http://www.asciitable.com/ This is basically what you want to use. — , Dec 20 '20 at 18:09
You can't set the charset **of the String**, it is UTF-16. You can state the charset of the byte array you're using to initializd the String; this tells the runtime how to convert into UTF-16. — a guest, Dec 20 '20 at 18:25
Why not use an existing [hash function](https://en.m.wikipedia.org/wiki/Hash_function) rather than invent your own? [`String::hashCode`](https://en.m.wikipedia.org/wiki/Hash_function) returns an `int`. — Basil Bourque, Dec 20 '20 at 18:29
@BasilBorque - that's not what he's trying to do. He seems to want to replace specific letters by specific values. — a guest, Dec 20 '20 at 18:31

score 1 · Accepted Answer · answered Dec 20 '20 at 18:35

1

A Java String is always encoded in UTF-16; input and output are converted as necessary.

This, however, can be better written:

 if (char < 'a') {char += 32;}

as

 if (ch >= 'A' && ch <= 'Z')
    ch += ('a' - 'A');

Reason:

Checking for the expected range is just more cautious
You do not need to 'know' that the distance between lower-case alphabetics and upper-case alphabetics is 32.

Also, 'char' is a keyword in Java.

This of course only works for letters in the unaccented USA/UK alphabet.

However, I would suggest you use (as you yourself stated) 'toLowerCase()' since that's what it's there for - to relieve you of details.

answered Dec 20 '20 at 18:35

a guest

462
3
5

I do not actually use `char` as the name, it's just a placeholder for my complicated variable name – Aharon K Dec 20 '20 at 19:09
Changing to lower case by subtraction works for A-Z, but not generally. If you want to support letters beyond A-Z, you need to use a case mapping function. – Peter Constable Dec 22 '20 at 16:11
And then you have to worry about letters that have more than one lower-case representation: Σ maps to either σ or ς. – a guest Dec 22 '20 at 23:39
While there were no API changes, since Java 9 it has not been true that _"A Java String is always encoded in UTF-16"_. See JEP 400 and [this SO answer](https://stackoverflow.com/a/9699138/2985643) to the question _"What is the Java's internal represention for String? Modified UTF-8? UTF-16?"_. – skomisa Mar 22 '21 at 19:52

Clarification on how character encodings work

1 Answers1