Only Unicode escapes
If you want to unescape only sequences in the format \u0000
than it is simple to do it with a single regex replace:
def unescapeUnicode(str: String): String =
"""\\u+([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar match {
case '\\' => """\\"""
case '$' => """\$"""
case c => c.toString
})
And the result is
scala> unescapeUnicode("b\\u00f4lovar \\u30B7")
res1: String = bôlovar シ
We have to process characters $
and \
separately, because they are treated as special by the java.util.regex.Matcher.appendReplacement
method:
def wrongUnescape(str: String): String =
"""\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str,
m => Integer.parseInt(m.group(1), 16).toChar.toString)
scala> wrongUnescape("\\u00" + Integer.toString('$', 16))
java.lang.IllegalArgumentException: Illegal group reference: group index is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:819)
... 46 elided
scala> wrongUnescape("\\u00" + Integer.toString('\\', 16))
java.lang.IllegalArgumentException: character to be escaped is missing
at java.util.regex.Matcher.appendReplacement(Matcher.java:809)
... 46 elided
All escape characters
Unicode character escapes are a bit special: they are not a part of string literals, but a part of the program code. There is a separate phase to replace unicode escapes with characters:
scala> Integer.toString('a', 16)
res2: String = 61
scala> val \u0061 = "foo"
a: String = foo
scala> // first \u005c is replaced with a backslash, and then \t is replaced with a tab.
scala> "\u005ct"
res3: String = " "
There is a function StringContext.treatEscapes
in Scala library, that supports all normal escapes from the language specification.
So if you want to support unicode escapes and all normal Scala escapes, you can unescape both sequentially:
def unescape(str: String): String =
StringContext.treatEscapes(unescapeUnicode(str))
scala> unescape("\\u0061\\n\\u0062")
res4: String =
a
b
scala> unescape("\\u005ct")
res5: String = " "