4

I have a string "\ufffd\ufffd hello\n"

i have a code like this

    fun main() {
      val bs = "\ufffd\ufffd hello\n"
      println(bs) // �� hello
    }

and i want to see "\ufffd\ufffd hello", how can i escape \u for every hex values

UPD:

val s = """\uffcd"""
val req = """(?<!\\\\)(\\\\\\\\)*(\\u)([A-Fa-f\\d]{4})""".toRegex()
return s.replace(unicodeRegex, """$1\\\\u$3""")
vl4deee11
  • 81
  • 7

2 Answers2

2

(I'm interpreting the question as asking how to clearly display a string that contains non-printable characters.  The Kotlin compiler converts sequences of a \u followed by 4 hex digits in string literals into single characters, so the question is effectively asking how to convert them back again.)

Unfortunately, there's no built-in way of doing this.  It's fairly easy to write one, but it's a bit subjective, as there's no single definition of what's ‘printable‘…

Here's an extension function that probably does roughly what you want:

fun String.printable() = map {
    when (Character.getType(it).toByte()) {
        Character.CONTROL, Character.FORMAT, Character.PRIVATE_USE,
        Character.SURROGATE, Character.UNASSIGNED, Character.OTHER_SYMBOL
            -> "\\u%04x".format(it.toInt())
        else -> it.toString()
    }
}.joinToString("")

println("\ufffd\ufffd hello\n".printable()) // prints ‘\ufffd\ufffd hello\u000a’

The sample string in the question is a bad example, because \uFFFD is the replacement character — a black diamond with a question mark, usually shown in place of any non-displayable characters.  So the replacement character itself is displayable!

The code above treats it as non-displayable by excluding the Character.OTHER_SYMBOL type — but that will also exclude many other symbols.  So you'll probably want to remove it, leaving just the other 5 types.  (I got those from this answer.)

Because the trailing newline is non-displayable, that gets converted to a hex code too.  You could extend the code to handle the escape codes \t, \b, \n, \r and maybe \\ too if needed.  (You could also make it more efficient… this was done for brevity!)

gidds
  • 16,558
  • 2
  • 19
  • 26
  • Thank you, this is generally what I wanted to get, I very rarely write in kotlin/ java (usually golang or С) , and usually in such cases I work with bytes directly, this is a good solution – vl4deee11 Sep 13 '21 at 14:19
1

Simply escape the \ in your strings by adding another backslash in front of it:

val bs = "\\ufffd\\ufffd hello\n"

You can also use raw strings with """ so you don't have to escape the backslashes (which is useful for regex):

val bs = """\ufffd\ufffd hello\n"""

Note that in that case the \n would also NOT be counted as an LF character, and will be literally printed as the 2 characters "\n". You can add literal line breaks in your raw string if you want an actual line feed, though:

val bs = """\ufffd\ufffd hello
"""
Joffrey
  • 32,348
  • 6
  • 68
  • 100
  • 1
    Yes, thank you for the answer, I know about escaping through\, but I would like to do it automatically, since I have this type of data in json, so I would like to translate it to this type automatically – vl4deee11 Sep 13 '21 at 11:04
  • @vl4deee11 the `\u` notation is specific to string literals. Once the strings in the JSON response are converted to string instances, they contains actual characters. So how do you decide which characters exactly you would like to convert to a unicode escape code? If you want to keep the unicode escapes from the initial JSON, then you'll probably have to deal with custom JSON deserializers for strings, but I guess you could also decide of a set of characters that you want to replace with their unicode escape. – Joffrey Sep 13 '21 at 11:39
  • I need the character not to be encoded but to remain in the form of \uxxxx – vl4deee11 Sep 13 '21 at 11:50
  • As I said, once you have a `String` instance there is no such thing as characters encoded with `\uXXXX`. All characters are stored in the string with some encoding that is irrelevant to the programmer. So you can't know at this point which characters used to be part of a literal encoded in `\uXXXX` and which were written as-is. If you want to know this, you need to act at JSON deserialization time. If you don't want to mess up JSON deserialization, you can choose a character range that you will encode this way. – Joffrey Sep 13 '21 at 12:27
  • 1
    You probably didn't understand me a little, strings of this type come to me in json from kafka, these strings are not escaped in json, so they can't be decoded properly, but most likely I will need to look for such characters at the stage of deserialization. Thanks for your answer – vl4deee11 Sep 13 '21 at 12:48
  • What do you mean by "they are not escaped in JSON"? From what I understood you get json like `{ "text": "\ufffd\ufffd hello\n" }`, and you want to print that text in this form somewhere else. What I'm saying is that once the decoding of this JSON is over and you get an object like `Something(val text: String)`, there no way to know from that `text` string whether a particular character used to be encoded as `\uXXXX` in the JSON or was written directly there. It's just a character now. So you have to hook up some custom deserializer to prevent decoding `\uXXXX` codes if you really want that – Joffrey Sep 13 '21 at 14:03