1

I want to convert only the special characters to their UTF-8 equivalent character. For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.

The following is how i did the above conversion: The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.

I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.

How do I do the same in java?

splash
  • 13,037
  • 1
  • 44
  • 67
Manas Saxena
  • 2,171
  • 6
  • 39
  • 58
  • why did "#$" become "353695" ? – niceman Jul 06 '16 at 11:01
  • Following is how the conversion happened: The utf-8 in hexadecimal for # is 23 and in decimal is 35 The utf-8 in hexadecimal for $ is 24 and in decimal is 36 The utf-8 in hexadecimal for _ is 5f and in decimal is 65 Sry edited the question it is Abcds23#$_ss and not Abcds23#$ss – Manas Saxena Jul 06 '16 at 11:05
  • 1
    Never put more information into comments, update your question instead. – GhostCat Jul 06 '16 at 11:16
  • 1
    Doable but you will not be able to convert the string back. – Joop Eggen Jul 06 '16 at 11:28

2 Answers2

0

I don't know how do you define "special characters", but this function should give you an idea:

public static String convert(String str) 
{
    StringBuilder buf = new StringBuilder();
    for (int index = 0; index < str.length(); index++)
    {
        char ch = str.charAt(index);
        if (Character.isLetterOrDigit(ch))
            buf.append(ch);
        else
            buf.append(str.codePointAt(index));
    }
    return buf.toString();
}

@Test
public void test()
{
    Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
splash
  • 13,037
  • 1
  • 44
  • 67
  • Yes any character other than an alphanumeric character is a special character in my case . How is your program converting the special character to UTF-8 equivalent , as i donot see UTF-8 mentioned anywhere?Or is UTF-8 the default encoding used in java? – Manas Saxena Jul 06 '16 at 11:37
  • @user2713255 I think the secret lies in `Character.codePointAt`, yes I assume UTF-8 is the default – niceman Jul 06 '16 at 11:39
  • @user2713255 `Character.codePointAt`returns the Unicode code point at the given index. – splash Jul 06 '16 at 11:41
  • You can get the codepoint simply by casting the `char` to `int`. http://stackoverflow.com/a/2006544/2568885 – binoternary Jul 06 '16 at 11:43
  • Checking that the character is alphanumeric can be made more readaable: `Character.isLetterOrDigit(ch)` – binoternary Jul 06 '16 at 11:49
  • 1
    @binoternary I adopted your `isLetterOrDigit` suggestion, but I think `codePointAt` is more readable than simply casting. ;-) – splash Jul 06 '16 at 12:09
  • @binoternary Actually, casting char to int, gets you the same thing: one UTF-16 code unit, one or two of which encode a Unicode codepoint. – Tom Blodget Jul 06 '16 at 17:26
0

The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.

static String convert(String str) {
    int[] cps = str.codePoints()
            .flatMap((cp) ->
                Character.isLetterOrDigit(cp) && cp < 128
                ? IntStream.of(cp)
                : String.valueOf(cp).codePoints())
                    .toArray();
    return new String(cps, 0, cps.length);
}

String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.

Conversion is not undoable without delimiters.


On Unicode:

Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.

To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less). Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • Why is `cp< 128` condition required and what does it do? In my case any character other than alphabets and digits are considered as special characters – Manas Saxena Jul 06 '16 at 12:05
  • There are letters like `ü`, Greek ones, Arabic digits, and so on. Pure ASCII is upto 127, Using the class Character one could also formulate it elswise that the script/block is ASCII, but that is a bit more verbose. – Joop Eggen Jul 06 '16 at 12:12