32

In Java the String#toLowerCase method uses the default system Locale to determine how to handle lowercasing. If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?

EDIT: I'm mainly concerned about programming identifiers such as table and column names in a schema. As such I want English lower casing to apply.

Locale.ROOT states that it is the language/country neutral locale for the locale sensitive operations

Locale.ENGLISH would presumably also be a safe choice.

Lii
  • 11,553
  • 8
  • 64
  • 88
mchr
  • 6,161
  • 6
  • 52
  • 74
  • "some ASCII text": do you really mean ASCII text. Or do you mean "some text"? – Raedwald Apr 26 '12 at 16:11
  • I meant ASCII. I was trying to imply that I wasn't using any non ASCII chars. I have clarified on the question. – mchr Apr 26 '12 at 16:28

2 Answers2

21

Yes, Locale.ENGLISH is a safe choice for case operations for things like programming language identifiers and URL parts since it doesn't involve any special casing rules and all 7-bit ASCII characters in the ENGLISH case-convert to 7-bit ASCII characters.

That is not true for all other locales. In Turkish, the 'I' and 'i' characters are not case-converted to one another.

"Dotted and dotless I" explains:

The Turkish alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.

In Unicode, U+0131 is a lower case letter dotless i (ı). U+0130 (İ) is capital i with dot. ISO-8859-9 has them at positions 0xFD and 0xDD respectively. In normal typography, when lower case i is combined with other diacritics, the dot is generally removed before the diacritic is added; however, Unicode still lists the equivalent combining sequences as including the dotted i, since logically it is the normal dotted i character that is being modified.

Most Unicode software uppercases ı to I and lowercases İ to i, but, unless specifically set up for Turkish, it lowercases I to i and uppercases i to I. Thus uppercasing then lowercasing, or vice versa, changes the letters.

The list of special exceptions is maintained at http://unicode.org/Public/UNIDATA/SpecialCasing.txt

# ================================================================================

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

...

Community
  • 1
  • 1
Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • "That is not true for all other locales", which can not use ASCII. – Raedwald Apr 26 '12 at 16:08
  • 8
    In what circumstances would you use the ROOT locale? I have been using that to mean that I don't want to have any special case case-folding applied. – mchr Apr 26 '12 at 16:27
  • @Raedwald, I'm not sure I understand. Could you please expand on your comment? – Mike Samuel Apr 26 '12 at 17:48
  • @mchr, I believe `Locale.ROOT` is similar to `LANG=C` in shell environments. It specifies that collation is lexicographic, that no special case conditions apply, and that number formatting is close to that produced by `String.valueOf(number)`. – Mike Samuel Apr 26 '12 at 17:55
  • 5
    A DecimalFormat created with Locale.ROOT still uses grouping characters, which differs from `String.valueOf`. Otherwise I think Locale.ROOT might make it more clear that the value is used internally and not for display purposes. – Jörn Horstmann Apr 26 '12 at 21:04
  • @JörnHorstmann, Good to know. Does it reliably use '.' only to mean integer/fraction separator? – Mike Samuel Apr 26 '12 at 23:40
  • 1
    @MikeSamuel: I think it does, but I'm still searching if there are any differences to `Locale.ENGLISH`. – Jörn Horstmann Apr 27 '12 at 09:33
  • "That is not true for all other locales", which can not use ASCII: ASCII is an encoding that can not encode non-English Unicode code points. Therefore if some text is described as "ASCII" this means it is in English. – Raedwald Apr 27 '12 at 10:09
  • 1
    @Raedwald, "ASCII" is the name of an encoding, but it also refers to a particular unicode [code-page](http://unicode.org/charts/PDF/U0000.pdf). Locales for writing systems that do not use Roman characters do case-fold ASCII characters to ASCII characters since they do not include any special case conversion rules that involve ASCII codepoints. – Mike Samuel Apr 27 '12 at 15:23
  • @JörnHorstmann did you ever figure out the differences between `Locale.ROOT` and `Locale.ENGLISH`? – mirabilos Aug 15 '21 at 16:21
  • @mirabilos Are you wondering what `NumberFormat.getNumberInstance(Locale.ROOT)` does? If so, I get `#,##0.###` which, as @jörn-horstmann points out has a digit group separator. – Mike Samuel Aug 17 '21 at 16:50
  • @MikeSamuel I was more wondering whether I should use ENGLISH or ROOT for things like lowercasing, but also the differences between these two in general. The source code is… on a different abstraction level. – mirabilos Aug 17 '21 at 19:07
  • @mirabilos I believe case adjustments are the same. As pointed out in the answer, there are no special casing rules for *en*. For case adjustments of human language terms, you should use the user's locale though. For case adjustments of machine identifiers, maybe use [UAX-31](https://unicode.org/reports/tr31/#normalization_and_case). – Mike Samuel Aug 18 '21 at 14:18
3

If I am lowercasing some ASCII text and want to be sure that this is processed as expected which Locale should I use?

That depends on what "as expected" means for you. The point of allowing to specify a Locale is that uppercaseing/lowercasing does not work the same in all languages, even though they may use the same letters. So specify the Locale you and/or your customers live in, and it will probably work as you/they expect.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • The OP says "some ASCII text". As ASCII is useful for only English text, "as expected" must mean as expected in English. – Raedwald Apr 26 '12 at 16:10
  • @Raedwald, why English? Latin letters are used in several different languages, not only european. If you use diacritic letters from ASCII, a scope of languages is wider. – CoolMind Feb 02 '21 at 09:08
  • @CoolMind ASCII does not have *any* diacritic letters. You are probably confusing ASCII with one of the several 8-bit character sets that extend ASCII to provide diacritics and additional European letters. – Raedwald Feb 02 '21 at 11:13
  • @Raedwald, sorry, agree with you. Most Latin-based languages have more than 26 letters. – CoolMind Feb 02 '21 at 11:23