Which Locale should I specify when I call String#toLowerCase?

Question

In Java the String#toLowerCase method uses the default system Locale to determine how to handle lowercasing. If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?

EDIT: I'm mainly concerned about programming identifiers such as table and column names in a schema. As such I want English lower casing to apply.

Locale.ROOT states that it is the language/country neutral locale for the locale sensitive operations

Locale.ENGLISH would presumably also be a safe choice.

"some ASCII text": do you really mean ASCII text. Or do you mean "some text"? — Raedwald, Apr 26 '12 at 16:11
I meant ASCII. I was trying to imply that I wasn't using any non ASCII chars. I have clarified on the question. — mchr, Apr 26 '12 at 16:28

score 21 · Accepted Answer · edited Jun 20 '20 at 09:12

21

Yes, Locale.ENGLISH is a safe choice for case operations for things like programming language identifiers and URL parts since it doesn't involve any special casing rules and all 7-bit ASCII characters in the ENGLISH case-convert to 7-bit ASCII characters.

That is not true for all other locales. In Turkish, the 'I' and 'i' characters are not case-converted to one another.

"Dotted and dotless I" explains:

The Turkish alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.

In Unicode, U+0131 is a lower case letter dotless i (ı). U+0130 (İ) is capital i with dot. ISO-8859-9 has them at positions 0xFD and 0xDD respectively. In normal typography, when lower case i is combined with other diacritics, the dot is generally removed before the diacritic is added; however, Unicode still lists the equivalent combining sequences as including the dotted i, since logically it is the normal dotted i character that is being modified.

Most Unicode software uppercases ı to I and lowercases İ to i, but, unless specifically set up for Turkish, it lowercases I to i and uppercases i to I. Thus uppercasing then lowercasing, or vice versa, changes the letters.

The list of special exceptions is maintained at http://unicode.org/Public/UNIDATA/SpecialCasing.txt

# ================================================================================

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

...

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 26 '12 at 15:48

Mike Samuel

118,113
30
216
245

"That is not true for all other locales", which can not use ASCII. – Raedwald Apr 26 '12 at 16:08
8

In what circumstances would you use the ROOT locale? I have been using that to mean that I don't want to have any special case case-folding applied. – mchr Apr 26 '12 at 16:27
@Raedwald, I'm not sure I understand. Could you please expand on your comment? – Mike Samuel Apr 26 '12 at 17:48
@mchr, I believe `Locale.ROOT` is similar to `LANG=C` in shell environments. It specifies that collation is lexicographic, that no special case conditions apply, and that number formatting is close to that produced by `String.valueOf(number)`. – Mike Samuel Apr 26 '12 at 17:55
5

A DecimalFormat created with Locale.ROOT still uses grouping characters, which differs from `String.valueOf`. Otherwise I think Locale.ROOT might make it more clear that the value is used internally and not for display purposes. – Jörn Horstmann Apr 26 '12 at 21:04
@JörnHorstmann, Good to know. Does it reliably use '.' only to mean integer/fraction separator? – Mike Samuel Apr 26 '12 at 23:40
1

@MikeSamuel: I think it does, but I'm still searching if there are any differences to `Locale.ENGLISH`. – Jörn Horstmann Apr 27 '12 at 09:33
"That is not true for all other locales", which can not use ASCII: ASCII is an encoding that can not encode non-English Unicode code points. Therefore if some text is described as "ASCII" this means it is in English. – Raedwald Apr 27 '12 at 10:09
1

@Raedwald, "ASCII" is the name of an encoding, but it also refers to a particular unicode [code-page](http://unicode.org/charts/PDF/U0000.pdf). Locales for writing systems that do not use Roman characters do case-fold ASCII characters to ASCII characters since they do not include any special case conversion rules that involve ASCII codepoints. – Mike Samuel Apr 27 '12 at 15:23
@JörnHorstmann did you ever figure out the differences between `Locale.ROOT` and `Locale.ENGLISH`? – mirabilos Aug 15 '21 at 16:21
@mirabilos Are you wondering what `NumberFormat.getNumberInstance(Locale.ROOT)` does? If so, I get `#,##0.###` which, as @jörn-horstmann points out has a digit group separator. – Mike Samuel Aug 17 '21 at 16:50
@MikeSamuel I was more wondering whether I should use ENGLISH or ROOT for things like lowercasing, but also the differences between these two in general. The source code is… on a different abstraction level. – mirabilos Aug 17 '21 at 19:07
@mirabilos I believe case adjustments are the same. As pointed out in the answer, there are no special casing rules for *en*. For case adjustments of human language terms, you should use the user's locale though. For case adjustments of machine identifiers, maybe use [UAX-31](https://unicode.org/reports/tr31/#normalization_and_case). – Mike Samuel Aug 18 '21 at 14:18

score 3 · Answer 2 · answered Apr 26 '12 at 15:52

3

If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?

That depends on what "as expected" means for you. The point of allowing to specify a Locale is that uppercaseing/lowercasing does not work the same in all languages, even though they may use the same letters. So specify the Locale you and/or your customers live in, and it will probably work as you/they expect.

answered Apr 26 '12 at 15:52

Michael Borgwardt

342,105
78
482
720

The OP says "some ASCII text". As ASCII is useful for only English text, "as expected" must mean as expected in English. – Raedwald Apr 26 '12 at 16:10
@Raedwald, why English? Latin letters are used in several different languages, not only european. If you use diacritic letters from ASCII, a scope of languages is wider. – CoolMind Feb 02 '21 at 09:08
@CoolMind ASCII does not have *any* diacritic letters. You are probably confusing ASCII with one of the several 8-bit character sets that extend ASCII to provide diacritics and additional European letters. – Raedwald Feb 02 '21 at 11:13
@Raedwald, sorry, agree with you. Most Latin-based languages have more than 26 letters. – CoolMind Feb 02 '21 at 11:23

Which Locale should I specify when I call String#toLowerCase?

2 Answers2

Linked