0

I have a Java String object which contains a word like "resumè" or for that matter any word with any international character in it. What I want to do is to convert this to encode the non ASCII characters in an ASCII string like "resum\u00E8". How do I do this with Java?

Raedwald
  • 46,613
  • 43
  • 151
  • 237
Swapnonil Mukherjee
  • 2,312
  • 5
  • 22
  • 32
  • Please take a look at [this](http://stackoverflow.com/questions/285228/how-to-convert-utf-8-to-us-ascii-in-java) SO post. I am not sure if it solves your issue. – npinti Jun 23 '15 at 10:31
  • See also http://stackoverflow.com/questions/1453171/%c5%84-%c7%b9-%c5%88-%c3%b1-%e1%b9%85-%c5%86-%e1%b9%87-%e1%b9%8b-%e1%b9%89-%cc%88-%c9%b2-%c6%9e-%e1%b6%87-%c9%b3-%c8%b5-n-or-remove-diacritical-marks-from-unicode-cha – Raedwald Jun 23 '15 at 12:19

6 Answers6

1

Here's simple implementation (based on java.util.Properties.saveConvert private method):

private static final char[] hexDigit = { '0', '1', '2', '3', '4', '5', '6',
        '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };

public static String escapeUnicode(String str) {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < str.length(); i++) {
        char aChar = str.charAt(i);
        if ((aChar < 0x0020) || (aChar > 0x007e)) {
            sb.append('\\');
            sb.append('u');
            sb.append(hexDigit[((aChar >> 12) & 0xF)]);
            sb.append(hexDigit[((aChar >> 8) & 0xF)]);
            sb.append(hexDigit[((aChar >> 4) & 0xF)]);
            sb.append(hexDigit[(aChar & 0xF)]);
        } else {
            sb.append(aChar);
        }
    }
    return sb.toString();
}
Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
  • outputs \uFFFD and not \u00E8. – Swapnonil Mukherjee Jun 23 '15 at 10:54
  • @SwapnonilMukherjee: make sure you are reading the input file in correct encoding (or your javac uses the same character set as source file if you test against the constant in the source file). – Tagir Valeev Jun 23 '15 at 11:09
  • Tagir, I am not using any file. I am just running the code this like. `code` public static void main(String[] args) throws UnsupportedEncodingException { String data = "è"; System.out.println(escapeUnicode(data)); } `code` prints \uFFFD – Swapnonil Mukherjee Jun 23 '15 at 11:16
  • I am using IntelliJ Community Edition, Windows 64 Bit and Java 7. Is the platform or machine encoding causing this issue? I don't know. Can you tell me what environment you are using? – Swapnonil Mukherjee Jun 23 '15 at 11:19
  • 1
    Works Now. I had to change the character encoding settings in IntelliJ. It was picking up the platform default Windows CP 1252. – Swapnonil Mukherjee Jun 23 '15 at 11:31
1

You can find unicode value of a char using below utility mehtod

private static String findUnicodeValue(char ch) {
    return "\\u" + Integer.toHexString(ch | 0x10000).substring(1);
}

You can then replace the char with the unicode value.

Vinod
  • 1,076
  • 2
  • 16
  • 33
1

Taking forward Tagir Valeev idea of picking up from java.util.Properties:

    package empty;

    public class CharsetEncode {

        public static void main(String[] args) {
            String s = "resumè";
            System.out.println(decompose(s));
        }

        public static String decompose(String s) {
            return saveConvert(s, true, true);
        }

        private static String saveConvert(String theString, boolean escapeSpace, boolean escapeUnicode) {
            int len = theString.length();
            int bufLen = len * 2;
            if (bufLen < 0) {
                bufLen = Integer.MAX_VALUE;
            }
            StringBuffer outBuffer = new StringBuffer(bufLen);

            for (int x = 0; x < len; x++) {
                char aChar = theString.charAt(x);
                // Handle common case first, selecting largest block that
                // avoids the specials below
                if ((aChar > 61) && (aChar < 127)) {
                    if (aChar == '\\') {
                        outBuffer.append('\\');
                        outBuffer.append('\\');
                        continue;
                    }
                    outBuffer.append(aChar);
                    continue;
                }
                switch (aChar) {
                case ' ':
                    if (x == 0 || escapeSpace)
                        outBuffer.append('\\');
                    outBuffer.append(' ');
                    break;
                case '\t':
                    outBuffer.append('\\');
                    outBuffer.append('t');
                    break;
                case '\n':
                    outBuffer.append('\\');
                    outBuffer.append('n');
                    break;
                case '\r':
                    outBuffer.append('\\');
                    outBuffer.append('r');
                    break;
                case '\f':
                    outBuffer.append('\\');
                    outBuffer.append('f');
                    break;
                case '=': // Fall through
                case ':': // Fall through
                case '#': // Fall through
                case '!':
                    outBuffer.append('\\');
                    outBuffer.append(aChar);
                    break;
                default:
                    if (((aChar < 0x0020) || (aChar > 0x007e)) & escapeUnicode) {
                        outBuffer.append('\\');
                        outBuffer.append('u');
                        outBuffer.append(toHex((aChar >> 12) & 0xF));
                        outBuffer.append(toHex((aChar >> 8) & 0xF));
                        outBuffer.append(toHex((aChar >> 4) & 0xF));
                        outBuffer.append(toHex(aChar & 0xF));
                    } else {
                        outBuffer.append(aChar);
                    }
                }
            }
            return outBuffer.toString();
        }

        private static char toHex(int nibble) {
            return hexDigit[(nibble & 0xF)];
        }

        /** A table of hex digits */
        private static final char[] hexDigit = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
    }
Dakshinamurthy Karra
  • 5,353
  • 1
  • 17
  • 28
  • This program prints resum\uFFFD. That's not what I needed here. I required resum\u00E8. http://www.fileformat.info/info/unicode/char/e8/index.htm – Swapnonil Mukherjee Jun 23 '15 at 11:13
  • Check the source encoding of the file in the IDE you are using. – Dakshinamurthy Karra Jun 23 '15 at 11:15
  • Works Now. I had to change the character encoding settings in IntelliJ. It was picking up the platform default Windows CP 1252. After I changed it to UTF-8, the code now works. So for this to work, I will have to make sure that our prod machines have UTF-8 as the default setting. – Swapnonil Mukherjee Jun 23 '15 at 11:31
0

You can use below code to convert è to \u00E8:

 class A {

public static void main(String[] args) {
     String data="è";
     String a = Converter.Uni2JavaLiteral(data);
     System.out.println("a=" + a);
    }

  }

 class Converter {
 private static char hexdigit(int c) {
    String charset = "0123456789ABCDEF";
    return charset.charAt(c & 0x0F);
}

private static String buildLiteral(char c) {
    String literal = hexdigit(c >>> 12) + "" + hexdigit(c >>> 8) + "" + hexdigit(c >>> 4) + "" + hexdigit(c);
    return literal;
}

public static String Uni2JavaLiteral(String input) {

    String literals = "";

    for (int i = 0; i < input.length(); i++) {
        if (input.charAt(i) == 10) {
            literals += "\\n";
        } else if (input.charAt(i) == 13) {
            literals += "\\r";
        } else if (input.charAt(i) == 92) {
            literals += "\\\\";
        } else if (input.charAt(i) == ' ') {
            literals += " ";
        } else if (input.charAt(i) < 32 || input.charAt(i) > 126) {
            literals += "\\u" + buildLiteral(input.charAt(i));
        } else {
            literals += input.charAt(i);
        }
    }
    return literals;
  }

}

Rafiq
  • 740
  • 1
  • 5
  • 17
  • The code public static void main(String[] args) { String data = "è"; System.out.println(Converter.Uni2JavaLiteral(data)); } Outputs \uFFFD and not \u00e8. – Swapnonil Mukherjee Jun 23 '15 at 10:52
  • I check the code no problem there and found output:a=\u00E8 – Rafiq Jun 23 '15 at 10:59
  • Your code prints a=\uFFFD. But what I need is a=\u00e8. – Swapnonil Mukherjee Jun 23 '15 at 11:04
  • I checked the code in TextPad and Netbeans IDE and found output:a=\u00E8 – Rafiq Jun 23 '15 at 11:25
  • Works Now. I had to change the character encoding settings in IntelliJ. It was picking up the platform default Windows CP 1252. After I changed it to UTF-8, the code now works. So for this to work, I will have to make sure that our prod machines have UTF-8 as the default setting. – Swapnonil Mukherjee Jun 23 '15 at 11:30
0

Here is an alternative for Java 8:

  public static String encode(int ch)
  {
    return (ch >= 32 && ch < 127)
        ? Character.toString((char)ch)
            : String.format("\\u%04X", ch);
  }

  public static String encode(String s)
  {
    return s.chars().mapToObj(ch -> encode(ch)).collect(Collectors.joining());
  }
clstrfsck
  • 14,715
  • 4
  • 44
  • 59
-1

Try using String.replace(), to convert a char into other char. See the method replaceAll() too.

carlos gil
  • 139
  • 3