10

I have a string

$string= 'AbCdEf';

and I want to use the tr function to convert all the uppercase letters to lower case and all the lower case to upper case.... at the same time. I basically just want to reverse it to become.

aBcDeF

I came up with this line, but I'm not sure how to modify it to do what I want. Any help please?

$string=~ tr/A-Z/a-z/;

Thanks!

tchrist
  • 78,834
  • 30
  • 123
  • 180
Brian
  • 4,023
  • 8
  • 29
  • 36
  • I assume that you would want to leave untouched those uppercase letters that have no lowercase correspondents, and vice versa — right? That’s what I did in my solution. An example of that would be that abbreviation people use for *number*: "Nº". The abbreviation’s first letter is uppercase and has a lowercase form in "n", but its second letter is a lowercase letter that has no corresponding uppercase version to go along with it. So that would just be "nº" if you swapped cases, because there is no way to swap case on things like 0xBA (which is what º is), even though it is considered lowercase. – tchrist Apr 09 '11 at 22:51

3 Answers3

14

At Tom's request, the Unicode-clean (or locales-clean) version:

s/([[:upper:]])|([[:lower:]])/defined $1 ? lc $1 : uc $2/eg
hobbs
  • 223,387
  • 19
  • 210
  • 288
  • @tchrist true, but it's also more likely that someone will be able to read it without spending an hour consulting a unicode properties reference, and AFAICT it's only a performance issue, not a correctness one :) – hobbs Apr 09 '11 at 23:59
  • Well... it might be a correctness issue. I dunno. The problem isn’t well specified. But my version does stuff with titlecased things that change. If you don’t like `\p{CWU}` or `\p{CWL}`, you are more than welcome to use `\p{Changes_When_Uppercased}` and `\p{Changes_When_Lowercased}`. If you really need to look up to see what that means, I suspect that an English dictionary might be of more use here than would the Unicode Standard. :) – tchrist Apr 10 '11 at 01:29
  • Also, I wouldn’t ever use the legacy POSIX thingies like `[[:upper:]]`. Locales are a nastiness. You really want to use Unicode instead. I get really nervous looking at all the stuff in the *perlrecharclass* manpage that involves POSIX locales. I don’t like parsing through `\p{PosixAlpha}` vs ASCII alpha vs `\p{XPosixAlpha}`. If you are using locales, then you have 8-bit legacy data that you have for some reason forgotten to decode properly. What am I not thinking of? – tchrist Apr 10 '11 at 01:32
12

$string =~ tr/A-Za-z/a-zA-Z/;

friedo
  • 65,762
  • 16
  • 114
  • 184
  • 4
    Yes, and what’s the modern solution, the one that isn’t stuck in 7-bit ᴀsᴄɪɪ? :) – tchrist Apr 09 '11 at 21:33
  • 1
    There's nothing wrong with simple 7bit ASCII if that's what you're working with. – friedo Apr 09 '11 at 21:37
  • 8
    When someone says "all the uppercase", they are asking for `\p{Upper}`, not for `[A-Z]`. Similarly with "all the lowercase", where they are asking for `\p{Lower}` not `[a-z]`. Both a-z and A-Z have a *code smell*: “They’re always wrong — sometimes.” I hate being guaranteed to be sometimes wrong when a little bit more care can guarantee that I am never wrong. It’s like how there’s a world of difference between having a very very small race condition and having no race condition at all. The careful programmer knows only one of those two situations is right, so **always** avoids the other one. – tchrist Apr 09 '11 at 22:39
7

You can do the full Unicode solution either this way:

    s/ (\p{CWU}) | (\p{CWL}) /defined $1 ? uc $1 : lc $2/gex;

or this way

    s/ (\p{CWL}) | (\p{CWU}) /defined $1 ? lc $1 : uc $2/gex;

Depending on what you want to do with something that changes case in both directions, like Dz, whose uppercase is DZ and whose lowercase is dz.

If you run the second of those two substitutions across this input:

     @   0040  COMMERCIAL AT
     ©   00A9  COPYRIGHT SIGN
     Å   212B  ANGSTROM SIGN
     ⒜   249C  PARENTHESIZED LATIN SMALL LETTER A
     Ⓐ   24B6  CIRCLED LATIN CAPITAL LETTER A
     ⓐ   24D0  CIRCLED LATIN SMALL LETTER A
     A  FF21  FULLWIDTH LATIN CAPITAL LETTER A
     a  FF41  FULLWIDTH LATIN SMALL LETTER A
     Ⓒ   24B8  CIRCLED LATIN CAPITAL LETTER C
     ⓒ   24D2  CIRCLED LATIN SMALL LETTER C
     DZ   01F1  LATIN CAPITAL LETTER DZ
     Dz   01F2  LATIN CAPITAL LETTER D WITH SMALL LETTER Z
     dz   01F3  LATIN SMALL LETTER DZ
     ⅲ   2172  SMALL ROMAN NUMERAL THREE
     S   0053  LATIN CAPITAL LETTER S
     s   0073  LATIN SMALL LETTER S
     ſ   017F  LATIN SMALL LETTER LONG S
     ⒮   24AE  PARENTHESIZED LATIN SMALL LETTER S
     Ⓢ   24C8  CIRCLED LATIN CAPITAL LETTER S
     ⓢ   24E2  CIRCLED LATIN SMALL LETTER S
     Ꞅ   A784  LATIN CAPITAL LETTER INSULAR S
     ꞅ   A785  LATIN SMALL LETTER INSULAR S
     ß   00DF  LATIN SMALL LETTER SHARP S
     ẞ   1E9E  LATIN CAPITAL LETTER SHARP S
     Ⅶ   2166  ROMAN NUMERAL SEVEN
     ⅻ   217B  SMALL ROMAN NUMERAL TWELVE

it produces these results:

     @   0040  commercial at
     ©   00a9  copyright sign
     å   212b  angstrom sign
     ⒜   249c  parenthesized latin small letter a
     ⓐ   24b6  circled latin capital letter a
     Ⓐ   24d0  circled latin small letter a
     a  ff21  fullwidth latin capital letter a
     A  ff41  fullwidth latin small letter a
     ⓒ   24b8  circled latin capital letter c
     Ⓒ   24d2  circled latin small letter c
     dz   01f1  latin capital letter dz
     dz   01f2  latin capital letter d with small letter z
     DZ   01f3  latin small letter dz
     Ⅲ   2172  small roman numeral three
     s   0053  latin capital letter s
     S   0073  latin small letter s
     S   017f  latin small letter long s
     ⒮   24ae  parenthesized latin small letter s
     ⓢ   24c8  circled latin capital letter s
     Ⓢ   24e2  circled latin small letter s
     ꞅ   a784  latin capital letter insular s
     Ꞅ   a785  latin small letter insular s
     SS   00df  latin small letter sharp s
     ß   1e9e  latin capital letter sharp s
     ⅶ   2166  roman numeral seven
     Ⅻ   217b  small roman numeral twelve

The only part that would be different (in that set) using the first function would be that the dz sequence would then look like this instead:

     dz   01f1  latin capital letter dz
     DZ   01f2  latin capital letter d with small letter z
     DZ   01f3  latin small letter dz

The reason you don’t want to use just an upper or lower test is because then you do unnecessary work, since there are plenty of cased code points that do not change case when casemapped. All of these, for example, are cased code points but which change neither when uppercased nor when lowercased:

     ª   00AA FEMININE ORDINAL INDICATOR
     ᴬ   1D2C MODIFIER LETTER CAPITAL A
     ᴀ   1D00 LATIN LETTER SMALL CAPITAL A
     ℂ   2102 DOUBLE-STRUCK CAPITAL C
     ᴰ   1D30 MODIFIER LETTER CAPITAL D 
     ʣ   02A3 LATIN SMALL LETTER DZ DIGRAPH
     ʤ   02A4 LATIN SMALL LETTER DEZH DIGRAPH
     ℇ   2107 EULER CONSTANT
     ɘ   0258 LATIN SMALL LETTER REVERSED E
     ɞ   025E LATIN SMALL LETTER CLOSED REVERSED OPEN E
     ℊ   210A SCRIPT SMALL G
     ɡ   0261 LATIN SMALL LETTER SCRIPT G
     ɢ   0262 LATIN LETTER SMALL CAPITAL G
     ʰ   02B0 MODIFIER LETTER SMALL H
     ℋ   210B SCRIPT CAPITAL H
     ℎ   210E PLANCK CONSTANT 
     ℹ   2139 INFORMATION SOURCE
     ʲ   02B2 MODIFIER LETTER SMALL J
     ℳ   2133 SCRIPT CAPITAL M
     º   00BA MASCULINE ORDINAL INDICATOR
     ɸ   0278 LATIN SMALL LETTER PHI
     ĸ   0138 LATIN SMALL LETTER KRA
     ʏ   028F LATIN LETTER SMALL CAPITAL Y
     ℼ   213C DOUBLE-STRUCK SMALL PI

So you would detect that they were upper- or lowercase, then call the inverse mapping function, then discover that nothing at all had changed. I figure, why bother?

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Whose fault is it that `ß` isn't uppercased to `ẞ`? – daxim Apr 10 '11 at 19:41
  • 1
    @daxim: Unicode defines the uppercase mapping of U+DF as U+53 U+53; that is, for **ß** to uppercase to **SS**. This is found in the file `SpecialCasing.txt` within the `unicore/` directory. U+00DF ‹ß› `\N{LATIN SMALL LETTER SHARP S}` has `\p{Age:1.1}`, whereas U+1E9E ‹ẞ› `\N{LATIN CAPITAL LETTER SHARP S}` has `\p{Age:5.1}`. Round‐tripping on casemapping transforms has never been guaranteed, you know. Consider how U+3C3 **σ** and U+3C2 **ς** both become U+3A3 **Σ** when uppercased, yet that same U+3A3 **Σ** becomes only U+3C3 **σ** when lowercased. There are countless similar examples of this. – tchrist Apr 10 '11 at 20:22