2

I have been using ASCII folding filter to handle diacritics for not just the documents in elastic search but various other kinds of strings.

public static String normalizeText(String text, boolean shouldTrim, boolean shouldLowerCase) {
        if (Strings.isNullOrEmpty(text)) {
            return text;
        }
        if (shouldTrim) {
            text = text.trim();
        }
        if (shouldLowerCase) {
            text = text.toLowerCase();
        }
        char[] charArray = text.toCharArray();

        // once a character is normalized it could become more than 1 character. Official document says the output
        // length should be of size >= length * 4.
        char[] out = new char[charArray.length * 4 + 1];
        int outLength = ASCIIFoldingFilter.foldToASCII(charArray, 0, out, 0, charArray.length);
        return String.copyValueOf(out, 0, outLength);
    }

However, as per the official documentation, the method has a note This API is for internal purposes only and might change in incompatible ways in the next release. The alternative is to use foldToASCII(char[] input, int length) non-static method (this method internally calls the same static method) but using it requires preparing ascii folding filter, token filter, token stream, an analyzer (this requires choosing the kind of analyzer and I might have to create a custom one). I couldn't find examples where the developers have done the latter. I tried writing some solutions of my own, but non-static foldingToAscii doesn't return the exact output, it attaches a list of unwanted characters in the end. I am wondering how various developers have dealt with this?

EDIT: I also see that some open source projects are using static foldToAscii so another question would be if it is really worth it to use non static foldToAscii

A_G
  • 2,260
  • 3
  • 23
  • 56
  • 2
    [Here](https://stackoverflow.com/questions/59723144/using-lucene-analyzer-without-indexing-is-my-approach-reasonable) is how I have dealt with this - it's what you mention above: _Use an analyzer and a token stream._ It's really not much code - and has been working fine for me. One thing about Lucene's ASCII folding is that it's a much broader interpretation than you get via [normalization](https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html), so it can be useful because of that. – andrewJames Aug 25 '20 at 19:24
  • Hey, @andrewjames thanks for the concise solution! I used a similar outline – A_G Aug 26 '20 at 03:37

1 Answers1

1

Based on comment by @andrewJames, below is the closest I was able to come up with not using the static method. KeyworkdTokenizer emits the entire input as a single token, so there is no need to loop through tokens.

String text = "Caffè";
String output = "";

try (Analyzer analyzer = CustomAnalyzer.builder()
              .withTokenizer(KeywordTokenizerFactory.class)
              .addTokenFilter(ASCIIFoldingFilterFactory.class)
              .build()) {
    try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
        CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        if (ts.incrementToken()) {
            output = charTermAtt.toString();
        }
        ts.end();
    }
} catch (IOException e) {
}

System.out.println(output);
Peter
  • 400
  • 1
  • 13