5

According to Wikipedia, in 2017 using an uppercase (Unicode U+1E9E) was officially adopted--at least as an option--for what may in fact be a subset of fully-capitalized words in German:

In June of that year, the Council for German Orthography officially adopted a rule that ⟨ẞ⟩ would be an option for capitalizing ⟨ß⟩ besides the previous capitalization as ⟨SS⟩ (i.e., variants STRASSE and STRAẞE would be accepted as equally valid).2

It seems like this addition to the German language would greatly simplify case-comparisons between strings (so-called "case-folding" or "fold-case" comparisons). Note, I started this inquiry trying to understand Raku's (a.k.a. Perl6's) implementation, but the question in fact seems to generalize to other programming languages. Here is Raku's default implementation--starting with 13 words from rfdr_Regeln_2017.pdf that have been lowercased (via Raku's .lc function):

~$ cat TO_ẞ_OR_NOT_TO_ẞ.txt
maß straße grieß spieß groß grüßen außen außer draußen strauß beißen fleiß heißen
~$ raku -ne '.words>>.match(/^ <:Ll>+ $/).say;' TO_ẞ_OR_NOT_TO_ẞ.txt
(「maß」 「straße」 「grieß」 「spieß」 「groß」 「grüßen」 「außen」 「außer」 「draußen」 「strauß」 「beißen」 「fleiß」 「heißen」)
~$ raku -ne '.uc.say;' TO_ẞ_OR_NOT_TO_ẞ.txt
MASS STRASSE GRIESS SPIESS GROSS GRÜSSEN AUSSEN AUSSER DRAUSSEN STRAUSS BEISSEN FLEISS HEISSEN
~$ raku -ne '.fc.say;' TO_ẞ_OR_NOT_TO_ẞ.txt
mass strasse griess spiess gross grüssen aussen ausser draussen strauss beissen fleiss heissen

I'm suprised that Raku's fc fold-case implementation essentially converts to lowercase ss. It's no surprise then that trying to search for eq string equality between the upper/lower "round-tripped" words and the original are all False:

~$ raku -ne 'for .words {print $_.uc.lc eq $_.lc }; "".put;'  TO_ẞ_OR_NOT_TO_ẞ.txt
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

Fold-cased (.fc) words match, but they do so on the basis of ss characters, not ß:

~$ raku -ne 'for .words {print $_.uc.lc eq $_.fc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ.txt
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue

Starting from a capital-ẞ, taking just one capitalized/uppercase word again demonstrates the dichotomy:

~$ echo "straße STRASSE STRAẞE" | raku -ne ' .put for .words;'
straße
STRASSE
STRAẞE
~$ echo "straße STRASSE STRAẞE" | raku -ne ' .lc.say for .words;'
straße
strasse
straße
~$ echo "straße STRASSE STRAẞE" | raku -ne ' for .words { say $_.lc eq "straße" };'
True
False
True
~$ echo "straße STRASSE STRAẞE" | raku -ne ' for .words { say $_.lc eq $_.fc };'
False
True
False

Have any programming languages instituted a foldcase conversion between lowercase ß <--> uppercase , by default? What programming languages have added lowercase ß <--> uppercase conversion, as an option (or via a library)? Many Questions/Answers on StackOverflow pre-date the 2017 decision, so I'm looking for up-to-date answers.

[ADDENDUM: I note via this FAQ that the Unicode Consortium's rules appear to be at odds with the 2017 decision of the Council for German Orthography].

Elizabeth Mattijsen
  • 25,654
  • 3
  • 75
  • 105
jubilatious1
  • 1,999
  • 10
  • 18
  • 3
    cf [a reddit comment I wrote about a year ago about that `ẞ` decision as it relates to PLs, their implementations, and MoarVM](https://www.reddit.com/r/ProgrammingLanguages/comments/qddpib/comment/hjlofnj/). – raiph Apr 30 '23 at 04:24
  • 2
    "List programming languages that have feature X" is not a legitimate SO question. – n. m. could be an AI Apr 30 '23 at 05:05
  • @n.m. Reference? Also, what's preventing you from explaining the Raku implementation as detailed above? – jubilatious1 Apr 30 '23 at 05:11
  • @raiph Since you're a Raku expert maybe you could explain if Raku's implementation of case-folding differs from languages such as Java, C++, C#, and Perl? And maybe incorporating data from here: https://unicode.org/Public/UNIDATA/CaseFolding.txt and policy from here: https://unicode.org/faq/casemap_charprop.html#13 (and subsequent) would be nice, too! – jubilatious1 Apr 30 '23 at 05:14
  • https://stackoverflow.com/help/dont-ask There is no actual problem to be solved, for one. Not sure what exactly to explain. `straße` must casefold-compare equal to `STRASSE` which must casefold-compare equal to `strasse` (which is a legitimate spelling in Swiss German). – n. m. could be an AI Apr 30 '23 at 05:27
  • 1
    (contd) "Raku's fc fold-case implementation essentially converts to lowercase ss" This mapping is mandated by Unicode (the very same data file you link to in you comment above) and there are good reasons for that. Why would raku (or any other language indeed) do something non-standard? What kind of problem would it solve? – n. m. could be an AI Apr 30 '23 at 05:58
  • 1
    Defined in *Unicode technical standard* [UTS 10 Unicode Collation Algorithm](https://www.unicode.org/reports/tr10/#Special_Cases). – JosefZ Apr 30 '23 at 14:48
  • @n.m. Yes, I researched the answer after I wrote the question. Now I have to interpret the Unicode Consortium's rules. As of 2017 you have a German governmental body mandating something that the Unicode Consortium disallows. So there IS a question here--"how to handle this progammatically?"--but if no one else answer I guess I'll have to answer the question myself. – jubilatious1 Apr 30 '23 at 14:53
  • 1
    you have a German governmental body mandating something that the Unicode Consortium disallows" This is simply not true. There is no contradiction anywhere. The German Orthography Council rules on *orthography*, while Unicode concerns itself with *string matching*. The two are fully orthogonal. The German Orthography Council tells you which spellings are correct German. The Unicode consortium tells you which strings are considered equal when case is ignored. Strings do not have to be correctly spelled words to be compared. – n. m. could be an AI Apr 30 '23 at 15:30
  • 3
    *So there IS a question here--"how to handle this progammatically?* So why not ask *that* question instead? The answer would depend on what you mean by "it". If you want a program that can casefold-compare strings, then follow Unicode. If you want a program that correctly flips German words between all-caps and normal case convention according to German spelling rules, then you better start training your large language model now, because you can't do that without understanding what German words mean in context. – n. m. could be an AI Apr 30 '23 at 16:00
  • 3
    From the linked FAQ: "Since the [Unicode] case folding rules do not vary by language or context, this makes them unsuitable as the basis for displaying or transforming text for human consumption." Unicode case folding is not intended to match all language casing rules. And in fact it is provably *impossible* for it to do so. In some languages, capital "I" is not the uppercase version of lowercase "i". – Raymond Chen Apr 30 '23 at 18:50

1 Answers1

4

1. Lowercase/Uppercase:

In Raku, the default conversion from lowercase German ß is to uppercase SS, but this can be overcome (as shown below).

The Unicode Consortium has a special FAQ on these letters in the German language. However, if one wants to work around the first uc uppercasing issue using Raku, the "ß" => "ẞ" characters can be appropriately translated prior to calling the bog-standard uc uppercase method/function:

~$ cat TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
Maß Straße Grieß Spieß Groß Grüßen Außen Außer Draußen Strauß Beißen Fleiß Heißen
raku -ne '.uc.put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
MASS STRASSE GRIESS SPIESS GROSS GRÜSSEN AUSSEN AUSSER DRAUSSEN STRAUSS BEISSEN FLEISS HEISSEN
~$ raku -ne '.trans("ß" => "ẞ").put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
Maẞ Straẞe Grieẞ Spieẞ Groẞ Grüẞen Auẞen Auẞer Drauẞen Strauẞ Beiẞen Fleiẞ Heiẞen
~$ raku -ne '.trans("ß" => "ẞ").uc.put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
MAẞ STRAẞE GRIEẞ SPIEẞ GROẞ GRÜẞEN AUẞEN AUẞER DRAUẞEN STRAUẞ BEIẞEN FLEIẞ HEIẞEN

The code above works to uppercase text with instead of SS--and in true Raku/Perl spirit--there's more than one way to do it (TMTOWTDI):

~$ raku -ne '.trans("ß" => "ẞ").uc.put;' file
~$ raku -e '.trans("ß" => "ẞ").uc.put for lines();' file
~$ raku -e 'put .trans("ß" => "ẞ").uc for lines();' file
~$ raku -e 'slurp.trans("ß" => "ẞ").uc.put;' file
~$ raku -e 'slurp.trans( "\x[00DF]" => "\x[1E9E]" ).uc.put;' file
~$ raku -e 'slurp.trans("LATIN SMALL LETTER SHARP S".uniparse => "LATIN CAPITAL LETTER SHARP S".uniparse).uc.put;' file

2. Foldcase:

The Unicode Consortium promulgates a rule that foldcase pairs should be stable (according to the Unicode Casefolding Stability Policy).

As for fc foldcase stability, I had hoped that prior conversion of "ß" => "ẞ" would provide a "30th-uppercase character" that would act as a bicameral foldcase partner of lowercase ß (in a pair). The code below seems promising in that starting with a small sample of mixed-case text, you can "round-trip" from uppercase-to-lowercase, and still have output text matching lowercase:

~$ raku -ne 'for .words {print $_.uc.lc eq $_.lc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
~$ raku -ne 'for .words {print $_.trans("ß" => "ẞ").uc.lc eq $_.lc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue

However, the fc foldcase code below shows that the present course of action is to take an uppercase and convert to lowercase ss (not to lowercase ß). Essentially .fc foldcase converts uppercase or SS to lowercase ss, regardless:

~$ raku -ne '.trans("ß" => "ẞ").fc.put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
mass strasse griess spiess gross grüssen aussen ausser draussen strauss beissen fleiss heissen
~$ raku -ne 'for .words {print $_.trans("ß" => "ẞ").uc.fc eq $_.fc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
~$ raku -ne 'for .words {print $_.trans("ß" => "ẞ").uc.lc eq $_.fc }; "".put;' TO_ẞ_OR_NOT_TO_ẞ_tclc.txt
FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

Changes anticipated? According to a 2017 StackOverflow post, "Just wait half a century."

https://raku.org

jubilatious1
  • 1,999
  • 10
  • 18