1

How can I convert characters in Java from Extended ASCII or Unicode to their 7-bit ASCII equivalent, including special characters like open ( 0x93) and close ( 0x94) quotes to a simple double quote (" 0x22) for example. Or similarly dash ( 0x96) to hyphen-minus (- 0x2D). I have found Stack Overflow questions similar to this, but the answers only seem to deal with accents and ignore special characters.

For example I would like “Caffè – Peña” to transformed to "Caffe - Pena".

However when I use java.text.Normalizer:

String sample = "“Caffè – Peña”";
System.out.println(Normalizer.normalize(sample, Normalizer.Form.NFD)
                         .replaceAll("\\p{InCombiningDiacriticalMarks}", ""));

Output is

“Caffe – Pena”

To clarify my need, I am interacting with an IBM i Db2 database that uses EBCDIC encoding. If a user pastes a string copied from Word or Outlook for example, characters like the ones I specified are translated to SUB (0x3F in EBCDIC, 0x1A in ASCII). This causes a lot of unnecessary headache. I am looking for a way to sanitize the string so as little information as possible is lost.

John Y
  • 14,123
  • 2
  • 48
  • 72
Peter
  • 400
  • 1
  • 13
  • See sister site: Software Recommendations Stack Exchange. – Basil Bourque Mar 30 '22 at 16:31
  • 1
    Just use String.replace. https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(char,%20char) – passer-by Mar 30 '22 at 16:47
  • This is actually quite subjective so I would probably advise building your own conversion map that suits *you* – g00se Mar 30 '22 at 17:19
  • @BasilBourque there is tens if not hundreds of questions just like mine specifically asking for the same thing sans the special characters, why close mine? I edited the question to remove the concerning word. Hopefully this will suffice. – Peter Mar 30 '22 at 19:08
  • 1
    There is no universal method that can replace dashes or smart quotes, either in Java or in the Unicode specification. You will have to do it yourself with your own mappings. That said, you can probably use `s = s.replaceAll("\\p{Pd}", "-")` and `s = s.replaceAll("[\\p{Pi}\\p{Pf}]", "\"")` to make it shorter. See http://unicode.org/reports/tr44/#General_Category_Values. – VGR Mar 30 '22 at 19:26
  • I voted to close your Question because asking for library recommendations is explicitly off-topic here. Such questions tend to devolve into unproductive arguments. The sister site I recommended is designed to avoid that problem. You’ve reworded to avoid explicitly asking for a library. But given that you yourself have said there are tens if not hundreds of duplicate questions, then there is no point in reopening this one. – Basil Bourque Mar 30 '22 at 20:20
  • @BasilBourque yes there is tens if not hundreds of similar questions, but not one of them deals with special characters. They all deal with accent marks. – Peter Mar 30 '22 at 20:31
  • You’ve not yet defined "special characters". – Basil Bourque Mar 30 '22 at 23:17
  • @BasilBourque this must be a big misunderstanding. Under special characters I meant the very ones I listed as examples: opening and closing quotes and dash. As I said multiple times and listed in my examples, the existing questions only deal with removing accents like è -> e. But they will not convert “ to ". – Peter Mar 30 '22 at 23:42
  • @Peter: So are those three the *only* characters you consider to be "special"? While the question is under-specified, no-one will be able to help you... (To be clear: it's *incredibly* frustrating to provide an answer that does everything in the question, only to be told, "Oh but there's also this case that your answer doesn't cover." And it feels like that's very, very likely to happen here with such a vague description as "special characters".) – Jon Skeet Apr 01 '22 at 13:12
  • @JonSkeet, those are the three I came across that cause issues. I am looking for general solution that would handle any future cases. Maybe me choosing the name "special characters" was not great. I wanted to differentiate my question from all the others which are only concerned with accent marks or alphabet characters. I am sure that there are more characters like the ones I have provided. For example ellipsis, bottom quotes, open and closed single quote, etc. – Peter Apr 01 '22 at 13:44
  • Right. So as I suspected, a solution which covered those three *wouldn't* be acceptable. You need to come up with a definition for exactly which characters you'd expect to be covered. – Jon Skeet Apr 01 '22 at 13:47

3 Answers3

2

You can just use String.replace() to replace the quote characters as another commenter recommends, and you could grow the list of problematic characters over time.

You could also use a more generic function to replace or ignore any characters that can't be encoded. For instance:

    private String removeUnrepresentableChars(final String _str, final String _encoding) throws CharacterCodingException, UnsupportedEncodingException {
        final CharsetEncoder enccoder = Charset.forName(_encoding).newEncoder();
        enccoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        ByteBuffer encoded = enccoder.encode(CharBuffer.wrap(_str));
        return new String(encoded.array(), _encoding);
    }

    private String replaceUnrepresentableChars(final String _str, final String _encoding, final String _replacement) throws CharacterCodingException, UnsupportedEncodingException {
        final CharsetEncoder encoder = Charset.forName(_encoding).newEncoder();
        encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        encoder.replaceWith(_replacement.getBytes(_encoding));
        ByteBuffer encoded = encoder.encode(CharBuffer.wrap(_str));
        return new String(encoded.array(), _encoding);
    }

So you could call those with an _encoding of "IBM-037", for instance.

However, if your objective is to lose as little information as possible, you should evaluate whether the data can be stored in UTF-8 (CCSID 1208). This could handle the smart quotes and other "special characters" just fine. Depending on your database and application structure, such a change could be very small to implement, or it could be very large and risky! But the only way to have lossless translation is to use a unicode flavor, and UTF-8 is most sensible.

Jesse
  • 21
  • 2
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 05 '22 at 09:18
1

The commenters who have said your problem is "subjective" (not in the sense of opinion-based but in the sense of each person's specific requirements being slightly different from everyone else's) or poorly defined or inherently impossible... are technically correct.

But you are looking for something practical you can do to improve the situation, which is also completely valid.

The sweet spot in terms of balancing difficulty of implementation with accuracy of results is to stitch together what you've already found plus the suggestions from the less-negative commenters:

  • Handle the diacriticals and other "standardly normalizable" characters with standard normalization procedures.
  • Handle everything else with your own mapping (which may include the Unicode General_Category property, but ultimately might need to include your own hand-picked replacement of specific characters with other specific characters).

The above might cover "all" future cases, depending on where the data is coming from. Or close enough to all that you can implement it and be done with it. If you want to add some robustness, and will be around to maintain this process for a while, then you could also come up with a list of all the characters you want to allow in the sanitized result, and then set up some kind of exception or logging mechanism that will let you (or your successor) find new unhandled cases as they arise that can then be used to refine the custom part of the mapping.

John Y
  • 14,123
  • 2
  • 48
  • 72
1

After some digging I was able to find solution based on this answer using org.apache.lucene.analysis.ASCIIFoldingFilter

All the examples I was able to find were using the static version of the method foldToASCII as in this project:

private static String getFoldedString(String text) {
    char[] textChar = text.toCharArray();
    char[] output = new char[textChar.length * 4];
    int outputPos = ASCIIFoldingFilter.foldToASCII(textChar, 0, output, 0, textChar.length);
    text = new String(output, 0, outputPos);
    return text;
}

However that static method has a note on it saying

This API is for internal purposes only and might change in incompatible ways in the next release.

So after some trial and error I came up with this version that avoids using the static method:

public static String getFoldedString(String text) throws IOException {
    String output = "";
    try (Analyzer analyzer = CustomAnalyzer.builder()
              .withTokenizer(KeywordTokenizerFactory.class)
              .addTokenFilter(ASCIIFoldingFilterFactory.class)
              .build()) {
        try (TokenStream ts = analyzer.tokenStream(null, new StringReader(text))) {
            CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);
            ts.reset();
            if (ts.incrementToken()) output = charTermAtt.toString();
            ts.end();
        }
    }
    return output;
}

Similar to an answer I provided here.

This does exactly what I was looking for and translates characters to their ASCII 7-bit equivalent version.

However, through further research I have found that because I am mostly dealing with Windows-1252 encoding and because of the way jt400 handles ASCII <-> EBCDIC (CCSID 37) translation, if an ASCII string is translated to EBCDIC and back to ACSII, the only characters that are lost are 0x80 through 0x9f. So inspired by the way lucene's foldToASCII handles it, I put together following method that handles these cases only:

public static String replaceInvalidChars(String text) {
    char input[] = text.toCharArray();
    int length = input.length;
    char output[] = new char[length * 6];
    int outputPos = 0;
    for (int pos = 0; pos < length; pos++) {
        final char c = input[pos];
        if (c < '\u0080') {
            output[outputPos++] = c;
        } else {
            switch (c) {
                case '\u20ac':  //€ 0x80
                    output[outputPos++] = 'E';
                    output[outputPos++] = 'U';
                    output[outputPos++] = 'R';
                    break;
                case '\u201a':  //‚ 0x82
                    output[outputPos++] = '\'';
                    break;
                case '\u0192':  //ƒ 0x83
                    output[outputPos++] = 'f';
                    break;
                case '\u201e':  //„ 0x84
                    output[outputPos++] = '"';
                    break;
                case '\u2026':  //… 0x85
                    output[outputPos++] = '.';
                    output[outputPos++] = '.';
                    output[outputPos++] = '.';
                    break;
                case '\u2020':  //† 0x86
                    output[outputPos++] = '?';
                    break;
                case '\u2021':  //‡ 0x87
                    output[outputPos++] = '?';
                    break;
                case '\u02c6':  //ˆ 0x88
                    output[outputPos++] = '^';
                    break;
                case '\u2030':  //‰ 0x89
                    output[outputPos++] = 'p';
                    output[outputPos++] = 'e';
                    output[outputPos++] = 'r';
                    output[outputPos++] = 'm';
                    output[outputPos++] = 'i';
                    output[outputPos++] = 'l';

                    break;
                case '\u0160':  //Š 0x8a
                    output[outputPos++] = 'S';
                    break;
                case '\u2039':  //‹ 0x8b
                    output[outputPos++] = '\'';
                    break;
                case '\u0152':  //Œ 0x8c
                    output[outputPos++] = 'O';
                    output[outputPos++] = 'E';
                    break;
                case '\u017d':  //Ž 0x8e
                    output[outputPos++] = 'Z';
                    break;
                case '\u2018':  //‘ 0x91
                    output[outputPos++] = '\'';
                    break;
                case '\u2019':  //’ 0x92
                    output[outputPos++] = '\'';
                    break;
                case '\u201c':  //“ 0x93
                    output[outputPos++] = '"';
                    break;
                case '\u201d':  //” 0x94
                    output[outputPos++] = '"';
                    break;
                case '\u2022':  //• 0x95
                    output[outputPos++] = '-';
                    break;
                case '\u2013':  //– 0x96
                    output[outputPos++] = '-';
                    break;
                case '\u2014':  //— 0x97
                    output[outputPos++] = '-';
                    break;
                case '\u02dc':  //˜ 0x98
                    output[outputPos++] = '~';
                    break;
                case '\u2122':  //™ 0x99
                    output[outputPos++] = '(';
                    output[outputPos++] = 'T';
                    output[outputPos++] = 'M';
                    output[outputPos++] = ')';
                    break;
                case '\u0161':  //š 0x9a
                    output[outputPos++] = 's';
                    break;
                case '\u203a':  //› 0x9b
                    output[outputPos++] = '\'';
                    break;
                case '\u0153':  //œ 0x9c
                    output[outputPos++] = 'o';
                    output[outputPos++] = 'e';
                    break;
                case '\u017e':  //ž 0x9e
                    output[outputPos++] = 'z';
                    break;
                case '\u0178':  //Ÿ 0x9f
                    output[outputPos++] = 'Y';
                    break;
                default:
                    output[outputPos++] = c;
                    break;
            }
        }
    }
    
    return new String(Arrays.copyOf(output, outputPos));
}

Since it turns out that my real problem was Windows-1252 to Latin-1 (ISO-8859-1) translation, here is a supporting material that shows the Windows-1252 to Unicode translation used in the method above to ultimately get Latin-1 encoding.

Peter
  • 400
  • 1
  • 13