21

I encountered a question in interview to write a method to check for similar words irrespective of character cases.

I answered it by using the difference of ASCII value for each pair of characters. But at home, when I went through the actual implementation of it in String.class, I get disturbed - Why is it implemented that way!

I tried to draw a comparison between inbuilt and my custom method, this way-

public class EqualsIgnoreCase {

    public static void main(String[] args) {
        String str1 = "Srimant @$ Sahu 959s";
        String str2 = "sriMaNt @$ sAhu 959s";

        System.out.println("Avg millisecs with inbuilt () - " + averageOfTenForInbuilt(str1, str2));
        System.out.println("\nAvg millisecs with custom () - " + averageOfTenForCustom(str1, str2));
    }

    public static int averageOfTenForInbuilt(String str1, String str2) {
        int avg = 0;
        for (int itr = 0; itr < 10; itr++) {
            long start1 = System.currentTimeMillis();
            for (int i = 0; i < 100000; i++) {
                str1.equalsIgnoreCase(str2);
            }
            avg += System.currentTimeMillis() - start1;
        }
        return avg / 10;
    }

    public static int averageOfTenForCustom(String str1, String str2) {
        int avg = 0;
        for (int itr = 0; itr < 10; itr++) {
            long start2 = System.currentTimeMillis();
            for (int i = 0; i < 100000; i++) {
                isEqualsIgnoreCase(str1, str2);
            }
            avg += System.currentTimeMillis() - start2;
        }
        return avg / 10;
    }

    public static boolean isEqualsIgnoreCase(String str1, String str2) {
        int length = str1.length();
        if (str2.length() != length) {
            return false;
        }

        for (int i = 0; i < length; i++) {
            char ch1 = str1.charAt(i);
            char ch2 = str2.charAt(i);

            int val = Math.abs(ch1 - ch2);
            if (val != 0) {
                if (isInAlphabetsRange(ch1, ch2)) {
                    if (val != 32) {
                        return false;
                    }
                } else {
                    return false;
                }
            }
        }
        return true;
    }

    public static boolean isInAlphabetsRange(char ch1, char ch2) {
        return (((ch1 <= 122 && ch1 >= 97) || (ch1 <= 90 && ch1 >= 65)) && ((ch2 <= 122 && ch2 >= 97) || (ch2 <= 90 && ch2 >= 65)));
    }

}

Output-

Avg millisecs with inbuilt () - 14

Avg millisecs with custom () - 5

I found that the inbuilt method is hitting efficiency, as because of lots of checks and method calls. Is there any specific reasons behind such an implementation? Or Am I missing something in my logic?

Any suggestions, will be heartily appreciated!

Raedwald
  • 46,613
  • 43
  • 151
  • 237
53by97
  • 425
  • 3
  • 17
  • 4
    Try to call `averageOfTenForCustom` first, and then `averageOfTenForInbuilt`: there might be a difference between your actual results because of the JVM startup. – sp00m May 27 '14 at 08:53
  • 42
    Conversion between lowercase and uppercase is not a simple task. You are working only with basic ASCII latin characters (character distance 32). However when you are working with the whole Unicode set, the problem is pretty complex. http://www.unicode.org/faq/casemap_charprop.html – Pavel Horal May 27 '14 at 08:54
  • 1
    @sp00m There is no significant difference while calling them alternately! I have already checked before. – 53by97 May 27 '14 at 09:01
  • 7
    Also considering differences of 0 and 32 as "matches" would mean that * (42) and J (74) are equal :-) So you'd return a match for "Jump" and "*ump" – Brian May 27 '14 at 09:04
  • @Brian Didnt test on special characters / extremities! Thanks for suggesting it! – 53by97 May 27 '14 at 09:20
  • @Brian it's easy to check that both characters are in the right range. What makes it a pain, it that some genius decided it was a good idea to put some special characters BETWEEN the lower case and uppercase letters. – Cruncher May 27 '14 at 20:08
  • 1
    even in ASCII your code runs wrong for characters that is not in the Laitn alphabet. For example it'll return true for "0" and "P" or "0" and "\x10' – phuclv May 28 '14 at 07:08

5 Answers5

64

Your routine only handles ASCII characters. The system one handles all unicode characters.

Consider following example:

public class Test {

    public static void main(String[] args) {
        System.out.println((int) 'ě'); // => 283
        System.out.println((int) 'Ě'); // => 282 
    }

}
Pavel Horal
  • 17,782
  • 3
  • 65
  • 89
Paul LeBeau
  • 97,474
  • 9
  • 154
  • 181
  • But in almost all cases we are not encountering with Unicode at regular basis, which inturn causing a performance hit to any simple usage of the same. – 53by97 May 27 '14 at 09:05
  • 26
    @53by97 Be careful about premature and naive optimization. I am pretty sure that whatever your code does, trying to optimize `equalsIgnoreCase` will not add any significant improvement. – Pavel Horal May 27 '14 at 09:10
  • @PavelHoral But it would be better if there would have 2 separate methods, both differentiated based on functionality. – 53by97 May 27 '14 at 09:14
  • 4
    All String are written so as to handle Unicode characters. ASCII is simply to small a set, that is what makes it easy to check only ASCII characters. If you look at the inner implementations of toUpperCase / toLowerCase etc, there is much more complexity involved then we could ever imagine.. :) – TheLostMind May 27 '14 at 09:16
  • 1
    By the way - Java is handling ASCII characters separately. If you dig deep enough, there are optimizations for ASCII already in place ;). http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/CharacterDataLatin1.java#CharacterDataLatin1.0instance – Pavel Horal May 27 '14 at 09:17
  • 3
    Could the system one be faster? Probably. There are trade-offs between code speed, developer time, and code readability. If maximum speed is important to your use case, then by all means use your routine (fix it first! :). However you should consider carefully whether it is needed. For instance what happens if, in the future, you need to handle Spanish names and everything breaks because of that optimisation you put in years ago? – Paul LeBeau May 27 '14 at 09:20
  • 40
    @53by97: "But in almost all cases we are not encountering with Unicode at regular basis" - how delightfully insular of you. This might be true in parts of the US; but it is entirely untrue in the world at large. Your toy implementation may be fit for purpose in the particular narrow field you need it but library designers need to build their code more robustly. – Jack Aidley May 27 '14 at 13:12
  • 33
    @53by97 As a professor of mine used to say "If it doesn't have to be correct, I can make it arbitrarily fast". And really anybody naïve enough to think that English only contains ASCII characters is in for a surprise ;) – Voo May 27 '14 at 14:11
  • 2
    @53by97 your first statement is wrong. Most languages that uses Latin or Cyrillic characters have special or accented characters outside the ASCII range, and you need to handle their cases correctly – phuclv May 27 '14 at 14:51
  • 7
    even English have many characters with diacritics such as the word naïve above, or résumé, exposé... – phuclv May 27 '14 at 14:55
  • 4
    @JackAidley: 53by97 is in fact Indian. ASCII-chauvinism is widespread in our country. When we want to write in our native languages we convert them into ASCII instead of using the proper scripts. Unicode? Why would anyone ever need Unicode? /s – Anubhav C May 27 '14 at 15:15
  • 3
    @AnubhavChattoraj: I apologise for assuming a US identity. None-the-less the wider point stands. – Jack Aidley May 27 '14 at 16:05
  • 12
    There's also a false shortcut with `if (str2.length() != length)`. `"weißbier".length() != "WEISSBIER".length()`. – Jon Hanna May 27 '14 at 16:55
  • 2
    @LưuVĩnhPhúc: There could still be significant value in using a multi-part test: first test reference equality, then ordinal equality, then a fast equality/inequality using ASCII-based case-folding (two ASCII characters which don't match should early-exit `false`), and then finally do a more complicated test. While there are many combinations of strings whose equality can only be determined by a slow inspection process, in many usage cases such tests will often be avoidable. – supercat May 27 '14 at 17:50
  • 2
    Also: `\u0069` ('i') -> `\u0130` (not 'I') and `\u0131` -> `\u0049` ('I') in Turkish – Brian S May 27 '14 at 22:34
  • Even if Java had provided a special version of the function, most people won't use it because it can't run correctly for their languages – phuclv May 28 '14 at 07:14
  • 1
    If you look at the MS's library, you'll see that they also have to work for all characters in Unicode. They can even parse numbers in [other scripts](http://blogs.msdn.com/b/oldnewthing/archive/2004/03/09/86555.aspx), not only Arabic numbers – phuclv May 28 '14 at 07:33
56

Your method is incorrect in many ways. For instance, it considers "!" equal to "B", "B" equal to "1", but "!" not equal to "1" (so it isn't transitive as we would expect an equals method to be).

Yes, it is quite easy to write an incorrect implementation for that method that is both faster and simpler. A fair challenge would be to write a correct one, i.e. that correctly handles all arguments the JDK implementation does.

You might also wish to look at How do I write a correct micro-benchmark in Java? to get more reliable performance measurements.

Community
  • 1
  • 1
meriton
  • 68,356
  • 14
  • 108
  • 175
  • 26
    +1 - Reminds me of a discussion I once read: "a: ah but your method is twice as slow as mine". "b: If I'm allowed to return incorrect results, I can make my method run in constant time" – Lieven Keersmaekers May 27 '14 at 11:00
  • @LievenKeersmaekers plz check the edited code, which accounts for alphabetic range! – 53by97 May 28 '14 at 09:13
  • Ok, now you are handling ASCII. What of characters from other charsets, such as ISO-8869-1, let alone Unicode? – meriton May 28 '14 at 09:42
11

This might not be the only reason, but the fact that your solution doesn't actually work for all possible strings is definitely a factor.

There are some (annoying) locales for which two characters might have the same upper case but not the same lower case. For this reason, in order to work (most of the time, see Turkish), the canonical implementation must compare the strings char-for-char in both their lower and upper cases.

Your implementation is probably perfect 99% of the time, especially if you only have to deal with the english locale, but the core library implementation can unfortunately not make such assumptions.

Nicolas Rinaudo
  • 6,068
  • 28
  • 41
  • The "two lower cases, one upper case" problem isn't present in Turkish, but it's in Greek (sigma). Turkish is fairly easy in comparison, it just doesn't have I as capital i. – MSalters May 27 '14 at 12:21
  • 2
    Sorry, I meant that `equalsToIgnoreCase` failed to work for some specific locales (such as Turkish because of the `i` problem), not that it suffered from the lower case / upper case issue - although I can see how this was not clear in the way I phrased my sentence. Thanks for giving an actual example of the issue though: I knew about the issue intellectually but had never actually *seen* it before. – Nicolas Rinaudo May 27 '14 at 12:27
4

I think that the checking of

String1.equalsIgnoreCase(String2)

the one which is provided has a much better character acceptance and it accepts all kind of Character Values included in the Unicode;but; what you have tried to figure out through your custom code is that you are comparing only the English Alphabetical characters.

So, I think,on the lines of Pavel Horel,the commentator on your post,that due to the sophistication it provides for comparison between all kinds of Unicode characers,it might be taking more time.

Am_I_Helpful
  • 18,735
  • 7
  • 49
  • 73
  • 1
    I gthink it would be better if the creator of String Class/methods had created 2 separate/overloaded methods to distinguish the same, which wont have affected the performance. – 53by97 May 27 '14 at 09:11
  • 9
    I think it would be better if you learned to consider performance only when it is a (measurable) problem in an application. If high-performance ASCII string manipulation were critical for an application, one would not choose java.lang.String in the first place. Please learn about [premature optimization](http://stackoverflow.com/questions/385506/when-is-optimisation-premature). Besides that, a smaller API is always better design unless you have very good reasons not to do so. – ignis May 27 '14 at 15:49
  • 2
    @ignis, The OP is coming here after 1) Receiving an interview question, 2) Answering it, and 3) Going home and researching the question more thoroughly. He's not trying to prematurely optimize some production code, he's trying to learn _why_ the library is the way it is. – Brian S May 27 '14 at 22:28
  • 1
    @BrianS This would still be premature optimization because he hasn't measured and verified that the existing implementation of equalsIgnoreCase is a significant performance bottleneck in an actual application. – Evicatos May 28 '14 at 00:24
  • 2
    @53by97 Yes, this is a very common criticism of Java's String class. `String` should probably have been an abstract class with 2 subclasses (ASCII and UTF-8). Right now, it is a `final` UTF-16 class that cannot be extended. Oh well, it does not matter for most applications. – Navin May 28 '14 at 01:32
  • 1
    @Evicatos, He's not trying to optimize, he's trying to learn. And that's exactly the sort of thing we should support here. He may have begun by heading in the wrong direction (the library implementation is "bad" because it's slower), but he's here for the right reason. – Brian S May 28 '14 at 04:09
  • @BrianS This question is based on the false assumption that there must be a justification for String.equalsIgnoreCase() not to have light-speed performance. This question is not only about the actual implementation, or ASCII vs UTF-16. – ignis May 28 '14 at 05:41
2

I think this exerpt from String.java is relevant:

if (ignoreCase) {
    // If characters don't match but case may be ignored,
    // try converting both characters to uppercase.
    // If the results match, then the comparison scan should
    // continue.
    char u1 = Character.toUpperCase(c1);
    char u2 = Character.toUpperCase(c2);
    if (u1 == u2) {
        continue;
    }
    // Unfortunately, conversion to uppercase does not work properly
    // for the Georgian alphabet, which has strange rules about case
    // conversion.  So we need to make one last check before
    // exiting.
    if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
        continue;
    }
}
zbstof
  • 1,052
  • 3
  • 14
  • 27