1

I'm looking for a method passing the following test cases:

 assertEquals(0, indexOfIgnoreCase("ss", "ß"));
 assertEquals(0, indexOfIgnoreCase("ß", "ss"));
 assertEquals(1, indexOfIgnoreCase("ßa", "a"));

The funny character (called German "sharp S") is not really exotic (U+00DF, present in Latin-1 Supplement Unicode block), unless you capitalize it: "ß".toUpperCase() returns "SS" (locale-independent).

My search for a solution working for at least the first 256 Unicode characters returned nothing but ICU4j, which I don't want to use.

This question (indirectly) asks for a case-insensitive version of String.contains, but note that most of the answers work for ASCII only. The accepted answer can be adapted like

final int flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
Pattern pattern = Pattern.compile(Pattern.quote(needle), flags);
final Matcher matcher = pattern.matcher(hay);
return matcher.find() ? matcher.start() : -1;

so it works also for non-ASCII and returns the position instead of a boolean. However, it fails the above tests.

Apache org.apache.commons.lang3.StringUtils doesn't pass either. This nice answer utilizing String.regionMatches provides a fast solution, but doesn't pass.

Converting to lowercase wouldn't suffice, converting to uppercase sort of would, but the last test case would return 2 instead of 1.


I'm a bit unsure about what the result of

indexOfIgnoreCase("ßa", "sa")

should be? 0.5 as the "needle" starts at the second S from the capitalization of ß?

Community
  • 1
  • 1
maaartinus
  • 44,714
  • 32
  • 161
  • 320

1 Answers1

0
  1. Convert original text and needle to char arrays
  2. Convert each character to upper case
  3. Find needle sub-array position in original text array.

For example:

char[] text = convertToUpperCase("...".toCharArray());
char[] needle = convertToUpperCase("...".toCharArray());

for (int i = 0; i < text.length - needle.length; i++)
    if (arraysEqual(needle, 0, text, i, needle.length)) // The same signature as System.arraycopy
        return i;

return -1;
ursa
  • 4,404
  • 1
  • 24
  • 38
  • I would like to point out that as far as Unicode goes, this is actually incorrect. Case-insensitive matching in Unicode must be done by comparing the case-folded version of the strings (as opposed to uppercase, lowercase, and titlecase version). – Wiz Jun 29 '15 at 03:12