0

From below link i can see some unknown characters of UCS-2. What are those? Why are those unknown? So we cannot decode them?

http://www.columbia.edu/kermit/ucs2.html

Basically user is sending an ucs-2, dcs 8 message to our router. But when i decode it, then i am getting some junk characters. Ex: D83E DD13 --> this is printed as ? or some junk, how to print and view them in proper value in text file.

Thanks & regards, Ashwini

Alex K.
  • 171,639
  • 30
  • 264
  • 288
ashu
  • 579
  • 2
  • 6
  • 17
  • If you google for "0xD83E 0xDD13" you can see its an emoji in UTF-16, if that's whats being sent then its not representable in UCS-2 – Alex K. Apr 10 '19 at 12:00
  • Any idea of DCS value of UTF-16? – ashu Apr 10 '19 at 12:24
  • As in GSM? There isn't one. You can convert UCS-2 to UTF-16. – Alex K. Apr 10 '19 at 12:33
  • In GSM, we r recieving dcs=8 (which is ucs-2) alond with this encoded value D83E DD13, we writing this in a text file using unicode converter, but its writing junk. Any idea how to write those emoji's in text file, is it possible? – ashu Apr 11 '19 at 06:26
  • 1
    Your input is UCS-2-BE (Big Endian) make sure your converting your input to UTF-16 from that as opposed to UCS-2-LE. Make sure your viewing the converted text file in an editor with the text encoding set to UTF-16 (again there are BE/LE variants) & ensure the font your using has a character for that emoji. – Alex K. Apr 11 '19 at 10:12
  • How to identify the characters fall in unicode range 0x0000 to 0xFFFF using java? – ashu Apr 12 '19 at 09:56
  • 1
    @ashu a Java `char` is 16bit, so ALL `char` values fall within the 0x0000-0xFFFF range. What you really need to ask is whether a given `char` represents a UTF-16 surrogate for a Unicode codepoint that is outside of the UCS-2 range (see [What is a "surrogate pair" in Java](https://stackoverflow.com/questions/5903008/)). You can use `Character.is(High|Low)Surrogate()` to test if a `char` is a UTF-16 surrogate or not. Codepoints that don't use surrogates are the same in both UCS-2 and UTF-16, Codepoints that require surrogates do not exist in UCS-2. – Remy Lebeau Apr 12 '19 at 21:45
  • @RemyLebeau I will have the string which will have 16bit char and utf-16 surrogate chars, so i need to trim all those high/low surrogate chars. I am using the below function, Is it correct? Please advice. str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", ""); – ashu Apr 15 '19 at 06:22
  • @RemyLebeau or is this feasible StringBuffer finalStr = new StringBuffer(); char[] chars = str.toCharArray(); for(int i=0;i – ashu Apr 15 '19 at 06:33
  • 1
    @ashu you don't need to specify the two surrogate ranges separately in the regular expression, they are sequential, so a single range will suffice: ```str.replaceAll("[\uD800-\uDFFF]", "");``` – Remy Lebeau Apr 15 '19 at 15:23
  • 1
    @ashu if you use the `StringBuffer` approach, you can simplify the loop by using `isSurrogate()` which tests for both high and low. And you don't need the `char[]` at all: `StringBuffer finalStr = new StringBuffer(); for(int i = 0; i < str.length(); i++){ char ch = str.charAt(i); if (!Character.isSurrogate(ch)) { finalStr.append(ch); }}` – Remy Lebeau Apr 15 '19 at 15:26
  • 1
    @RemyLebeau I finally found solution as str.replaceAll("[^\u0000-\uffff]", ""); Basically if character doesnt fall under basic multilingual plane, then i am replacing it with empty character.The solution provided str.replaceAll( "([\\ud800-\\udfff])", ""); not working. Input given String str = "heéaà"; output : ?heéaà. Its not replacing. – ashu Apr 16 '19 at 06:29
  • Does this answer your question? [What is a "surrogate pair" in Java?](https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java) – JosefZ Nov 08 '20 at 21:59

0 Answers0