7

I need to create a Collator which corresponds to https://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive/ i.e. it ignores the case sensitivity of ASCII A-Z and a-z characters when making comparisons.

I have attempted this with the following ICU4j RuleBasedCollator:

final RuleBasedCollator collator =
        new RuleBasedCollator("&a=A, b=B, c=C, d=D, e=E, f=F, g=G, h=H, "
                + "i=I, j=J, k=K, l=L, m=M, n=N, o=O, p=P, q=Q, r=R, s=S, t=T, "
                + "u=U, v=V, u=U, v=V, w=W, x=X, y=Y, z=Z").freeze();

However, the following comparison seems to fail, where I would expect it to succeed (i.e. return true):

final SearchIterator searchIterator = new StringSearch(
        "pu", new StringCharacterIterator("iNPut"), collator);
return searchIterator.first() >= 0;

What am I missing in my rules?

adamretter
  • 3,885
  • 2
  • 23
  • 43

2 Answers2

3
  1. This W3C "collation" does not look like a Collator in the usual sense. It's an ASCII-case-insensitive matcher without ordering. I suspect that it is usually implemented with low-level code that matches ASCII letters case-insensitively and everything else precisely. See https://www.w3.org/TR/xpath-functions-31/#html-ascii-case-insensitive-collation

  2. The Collator rules probably don't do what you think they do. The comma is old syntax for a tertiary difference, so &a=A, b=B, c=C is the same as &a=A<<<b=B<<<c=C. I think you were intending something like &a=A &b=B &c=C etc.

  • 1
    okay that makes sense thanks. However I am still having problems with writing a string contains method using `SearchIterator`. I took the code from my question and changed the collation rules to: `&a=A &b=B &c=C &d=D &e=E &f=F &g=G &h=H &i=I &j=J &k=K &l=L &m=M &n=N &o=O &p=P &q=Q &r=R &s=S &t=T &u=U &v=V &w=W &x=X &y=Y &z=Z` but `searchIterator.first()` still returns `-1`. – adamretter Nov 16 '17 at 22:02
2

com.ibm.icu.text.RuleBasedCollator#compare

Returns an integer value. Value is less than zero if source is less than target, value is zero if source and target are equal, value is greater than zero if source is greater than target

String a = "Pu";
String b = "pu";

RuleBasedCollator c1 = (RuleBasedCollator) Collator.getInstance(new Locale("en", "US", ""));
RuleBasedCollator c2 = new RuleBasedCollator("& p=P");
System.out.println(c1.compare(a, b) == 0);
System.out.println(c2.compare(a, b) == 0);

Output
======
false
true

It appears that the rules is not where the problem lies, something seems to be wrong with the SearchIterator code.


If you don't have to use the SearchIterator then perhaps you could write your own 'contains' method. Maybe something like this:

boolean contains(String a, String b, RuleBasedCollator c) {
  int index = 0;
  while (index < a.length()) {
    if (a.length() < b.length()) {
      return false;
    }

    if (c.compare(a.substring(0, b.length()), b) == 0) {
      return true;
    }

    a = a.substring(1);
  }
  return false;
}

Perhaps not the best code in the world, but you get the idea.

ParallelNoob
  • 306
  • 4
  • 8
  • Hmm that is interesting. I wonder if the rules are Asymmetrical? e.g. to compare in both directions, would I need to define `"& p=P, P=p"`? – adamretter Nov 13 '17 at 23:42
  • The equal sign works both ways so the rule should as well, yes? – ParallelNoob Nov 18 '17 at 08:05
  • From the [ICU collator customization user guide](http://userguide.icu-project.org/collation/customization): x=y, Signifies no difference between "x" and "y". – ParallelNoob Nov 18 '17 at 08:11