3

I have the below test case and only the first assertion passes. Why?

@Test
public void test() {
    String i1 = "i";
    String i2 = "İ".toLowerCase();

    System.out.println((int)i1.charAt(0)); // 105
    System.out.println((int)i2.charAt(0)); // 105

    assertTrue(i2.startsWith(i1));

    assertTrue(i2.endsWith(i1));
    assertTrue(i1.endsWith(i2));
    assertTrue(i1.startsWith(i2));
}

Update after replies

What I am trying to is using startsWith and endsWith in a case insensitive way such that, below expression should return true.

"ALİ".toLowerCase().endsWith("i");

I guess it is different for C# and Java.

Mehmet Ataş
  • 11,081
  • 6
  • 51
  • 78
  • 1
    Can you please change your question so that you're not doing `toLowerCase()`? What is the character `toLowerCase()` outputs? – 4castle Aug 04 '17 at 20:27
  • Please see my update with `println`s – Mehmet Ataş Aug 04 '17 at 20:33
  • Now try to print `i1.length()` and `i2.length()`. – Pshemo Aug 04 '17 at 20:34
  • OK lengths are different :) I updated my question again. – Mehmet Ataş Aug 04 '17 at 20:39
  • 1
    [`toLowerCase`](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#toLowerCase--) does not take a string as an argument, and it doesn't return a boolean, so it can't evaluate to true. – nbrooks Aug 04 '17 at 20:45
  • See also Java Bug [JDK-8020037 String.toLowerCase incorrectly increases length, if string contains \u0130 char](https://bugs.openjdk.java.net/browse/JDK-8020037) – Andreas Aug 04 '17 at 21:01

4 Answers4

6

This happens because lowercase İ ("latin capital letter i with dot above") in English locales turn into the two characters: "latin small letter i" and "combining dot above".

This explains why it starts with i, but doesnt end with i (it ends with a combining diacritic mark instead).

In a Turkish locale, lowercase İ simply becomes "latin small letter i" in accordance with Turkish linguistics rules, and your code would therefore work.

Here's a test program to help figure out what's going on:

class Test {
  public static void main(String[] args) {
    char[] foo = args[0].toLowerCase().toCharArray();
    System.out.print("Lowercase " + args[0] + " has " + foo.length + " chars: ");
    for(int i=0; i<foo.length; i++) System.out.print("0x" + Integer.toString((int)foo[i], 16) + " ");
    System.out.println();
  }
}

Here's what we get when we run it on a system configured for English:

$ LC_ALL=en_US.utf8 java Test "İ"
Lowercase İ has 2 chars: 0x69 0x307

Here's what we get when we run it on a system configured for Turkish:

$ LC_ALL=tr_TR.utf8 java Test "İ"
Lowercase İ has 1 chars: 0x69

This is even the specific example used by the API docs for String.toLowerCase(Locale), which is the method you can use to get the lowercase version in a specific locale, rather than the system default locale.

that other guy
  • 116,971
  • 11
  • 170
  • 194
3

İ is Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130), and is a Java String with a length of 1.

"İ".toLowerCase() returns a Java String with a length of 2:

And that is because there is no such character as a 'LATIN SMALL LETTER I WITH DOT ABOVE'. It does not exist in Unicode.

Andreas
  • 154,647
  • 11
  • 152
  • 247
3

After executing the toLowerCase() function, the string length is 2 instead of 1; the lower case version of that character is represented by two characters:

000> "İ".length()
===> 1
000> "İ".toLowerCase().length()
===> 2

The first character in its lowercase representation is a lowercase latin i, while the second character is a diacritic:

000> "İ".toLowerCase().charAt(0)
===> i
000> "İ".toLowerCase().charAt(1)
===> ̇

So the lowercase string does "start with" i, but it doesn't end with it.

nbrooks
  • 18,126
  • 5
  • 54
  • 66
1

Your test is failing because you are using wrong the methods...

String i2 = "İ" is a turkish capital form of i, and if you dont give a locale for the conversion then the method will fail

using a locale may help :)

public static void main(String[] args) {

    String i1 = "i";
    String i2 = "İ".toLowerCase(Locale.forLanguageTag("tr-TR"));

    System.out.println((int)i1.charAt(0)); // 105
    System.out.println((int)i2.charAt(0)); // 105

    System.out.println(i2.startsWith(i1));
    System.out.println(i2.endsWith(i1));
    System.out.println(i1.endsWith(i2));
    System.out.println(i1.startsWith(i2));
} 

the output will be

105

105

true

true

true

true

ΦXocę 웃 Пepeúpa ツ
  • 47,427
  • 17
  • 69
  • 97