1

I want to replace all occurrences of a regular expression of type "\uXXXX" where "XXXX" is an hexadecimal number representing a Unicode character to the corresponding character.

I tried the following Scala code:

def unscape(s : String) : String = {
 val rex = """\\u([0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z][0-9a-zA-Z])""".r
 rex.replaceAllIn(s,m => {
     hex2str(m.group(1))
   }
}

def hex2str(s:String): String = {
  Integer.parseInt(s,16).toChar.toString  
}

If I try, for example:

unscape("Hi\\u0024, \\u0024")

it gives the following exception:

java.lang.StringIndexOutOfBoundsException: String index out of range: 1

In this other question, it seems that there could be a bug in Java's treatment of Unicode characters. Is that the problem?

Community
  • 1
  • 1
Labra
  • 1,412
  • 1
  • 13
  • 33

2 Answers2

2

Just to tweak the accepted answer:

  def unscape3(s: String): String = {
    val rex = """\\u(\p{XDigit}{4})""".r
    rex.replaceAllIn(s, m => Regex quoteReplacement hex2str(m group 1))
  }

  Console println unscape3("""Hi\u0024, \u0024""")

Note that the character class is correct and you don't have to be aware of what needs escaping when using quoteReplacement.

(Maybe more efficient than scanning the replacement text multiple times.)

som-snytt
  • 39,429
  • 2
  • 47
  • 129
1

Try following:

def unscape(s: String): String = {
    val rex = """\\u([0-9a-fA-F]{4})""".r
    rex.replaceAllIn(s, m => {
        hex2str(m.group(1))
            .replaceAllLiterally("\\", "\\\\")
            .replaceAllLiterally("$", "\\$")
    })
}

According to Matcher.appendReplacement which is iternally used by replaceAllIn:

Note that backslashes (\) and dollar signs ($) in the replacement string may cause the results to be different than if it were being treated as a literal replacement string. Dollar signs may be treated as references to captured subsequences as described above, and backslashes are used to escape literal characters in the replacement string.

falsetru
  • 357,413
  • 63
  • 732
  • 636