Replace special characters in a string with their UTF-8 encoded character java?

Question

I want to convert only the special characters to their UTF-8 equivalent character. For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.

The following is how i did the above conversion: The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.

I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.

How do I do the same in java?

Following is how the conversion happened: The utf-8 in hexadecimal for # is 23 and in decimal is 35 The utf-8 in hexadecimal for $ is 24 and in decimal is 36 The utf-8 in hexadecimal for _ is 5f and in decimal is 65 Sry edited the question it is Abcds23#$_ss and not Abcds23#$ss — Manas Saxena, Jul 06 '16 at 11:05
Never put more information into comments, update your question instead. — GhostCat, Jul 06 '16 at 11:16

splash · Answer 1 · 2016-07-06T12:05:23.183

0

I don't know how do you define "special characters", but this function should give you an idea:

public static String convert(String str) 
{
    StringBuilder buf = new StringBuilder();
    for (int index = 0; index < str.length(); index++)
    {
        char ch = str.charAt(index);
        if (Character.isLetterOrDigit(ch))
            buf.append(ch);
        else
            buf.append(str.codePointAt(index));
    }
    return buf.toString();
}

@Test
public void test()
{
    Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}

edited Jul 06 '16 at 12:05

answered Jul 06 '16 at 11:28

splash

13,037
1
44
67

Yes any character other than an alphanumeric character is a special character in my case . How is your program converting the special character to UTF-8 equivalent , as i donot see UTF-8 mentioned anywhere?Or is UTF-8 the default encoding used in java? – Manas Saxena Jul 06 '16 at 11:37
@user2713255 I think the secret lies in `Character.codePointAt`, yes I assume UTF-8 is the default – niceman Jul 06 '16 at 11:39
@user2713255 `Character.codePointAt`returns the Unicode code point at the given index. – splash Jul 06 '16 at 11:41
You can get the codepoint simply by casting the `char` to `int`. http://stackoverflow.com/a/2006544/2568885 – binoternary Jul 06 '16 at 11:43
Checking that the character is alphanumeric can be made more readaable: `Character.isLetterOrDigit(ch)` – binoternary Jul 06 '16 at 11:49
1

@binoternary I adopted your `isLetterOrDigit` suggestion, but I think `codePointAt` is more readable than simply casting. ;-) – splash Jul 06 '16 at 12:09
@binoternary Actually, casting char to int, gets you the same thing: one UTF-16 code unit, one or two of which encode a Unicode codepoint. – Tom Blodget Jul 06 '16 at 17:26

Joop Eggen · Answer 2 · 2016-07-06T12:07:29.350

The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.

static String convert(String str) {
    int[] cps = str.codePoints()
            .flatMap((cp) ->
                Character.isLetterOrDigit(cp) && cp < 128
                ? IntStream.of(cp)
                : String.valueOf(cp).codePoints())
                    .toArray();
    return new String(cps, 0, cps.length);
}

String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.

Conversion is not undoable without delimiters.

On Unicode:

Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.

To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less). Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

Why is `cp< 128` condition required and what does it do? In my case any character other than alphabets and digits are considered as special characters — Manas Saxena, Jul 06 '16 at 12:05
There are letters like `ü`, Greek ones, Arabic digits, and so on. Pure ASCII is upto 127, Using the class Character one could also formulate it elswise that the script/block is ASCII, but that is a bit more verbose. — Joop Eggen, Jul 06 '16 at 12:12

Replace special characters in a string with their UTF-8 encoded character java?

2 Answers2