2

I have a multi-line JSON file with records that contain special characters encoded as hexadecimals. Here is an example of a single JSON record:

{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}

This record is supposed to be {"value":"ıarines Bintıç Ramuçlar"} , e.g. '"' character are replaced with corresponding hexadecimal \x22 and other special Unicode characters are replaced with one or two hexadecimals (for instance \xC3\xA7 encodes ç, etc.)

I need to convert similar Strings into a regular Unicode String in Scala, so when printed it produced {"value":"ıarines Bintıç Ramuçlar"} without hexadecimals.

In Python I can easily decode these records with a line of code:

>>> a = "{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"
>>> a.decode("utf-8")
u'{"value":"\u0131arines Bint\u0131\xe7 Ramu\xe7lar"}'
>>> print a.decode("utf-8")
{"value":"ıarines Bintıç Ramuçlar"}

But in Scala I can't find a way to decode it. I unsuccessfully tried to convert it like this:

scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(new String(a.getBytes(), "UTF-8"))
{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}

I also tried URLDecoder as I found in solution for similar problem (but with URL):

scala> val a = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""
scala> print(java.net.URLDecoder.decode(a.replace("\\x", "%"), "UTF-8"))
{"value":"ıarines Bintıç Ramuçlar"}

It produced the desired result for this example but is seems not safe for generic text fields since it designed to work with URLs and requires replacing all \x to % in the string.

Does Scala have some better way to deal with this issue?

I am new to Scala and will be thankful for any help

UPDATE: I have made a custom solution with javax.xml.bind.DatatypeConverter.parseHexBinary. It works for now, but it seems cumbersome and not at all elegant. I think there should be a simpler way to do this.

Here is the code:

import javax.xml.bind.DatatypeConverter
import scala.annotation.tailrec
import scala.util.matching.Regex

def decodeHexChars(string: String): String = {
  val regexHex: Regex = """\A\\[xX]([0-9a-fA-F]{1,2})(.*)""".r
  def purgeBuffer(buffer: String, acc: List[Char]): List[Char] = {
    if (buffer.isEmpty) acc
    else new String(DatatypeConverter.parseHexBinary(buffer)).reverse.toList ::: acc
  }
  @tailrec
  def traverse(s: String, acc: List[Char], buffer: String): String = s match {
    case "" =>
      val accUpdated = purgeBuffer(buffer, acc)
      accUpdated.foldRight("")((str, b) => b + str)
    case regexHex(chars, suffix) =>
      traverse(suffix, acc, buffer + chars)
    case _ =>
      val accUpdated = purgeBuffer(buffer, acc)
      traverse(s.tail, s.head :: accUpdated, "")
  }
  traverse(string, Nil, "")
}
Huko Jack
  • 23
  • 1
  • 6

2 Answers2

0

The problem is that encoding is really specific to python (i think). Something like this might work:

val s = """{\x22value\x22:\x22\xC4\xB1arines Bint\xC4\xB1\xC3\xA7 Ramu\xC3\xA7lar\x22}"""

"""\\x([A-F0-9]{2})""".r.replaceAllIn(s, (x: Regex.Match) => 
  new String(BigInt(x.group(1), 16).toByteArray, "UTF-8")
)
Alvaro Carrasco
  • 6,103
  • 16
  • 24
  • Thank you for the answer! It doesn't seem to work though. The issue is that some characters are coded with a single hexadecimal character and other are coded with a combination of two. I have posted an update to the original post, that also based on regular expression. – Huko Jack Jul 19 '17 at 21:14
  • @HukoJack one hex digit is invalid, there should always be exactly two. Otherwise, if you have `\xAA` how would you tell whether it is `\u013A` or just `ɒ` (`\u252`)? The convention is that if your run into `\x` followed by anything other than two hex digits, you either error out or take it literally. – Dima Jul 20 '17 at 00:31
  • @Dima but in cases when you have something like `"çtext` which would be encoded as `\x22\xC3\xA7text` wouldn't this greedy approach first try to decode `\x22\xC3` and then `\xA7` and produce an error? – Huko Jack Jul 20 '17 at 07:46
0

Each \x?? encodes one byte, like \x22 encodes " and \x5C encodes \. But in UTF-8 some characters are encoded using multiple bytes, so you need to transform \xC4\xB1 to ı symbol and so on.

replaceAllIn is really nice, but it might eat your slashes. So, if you don't use groups (like \1) in a replaced string, quoteReplacement is a recommended way to escape \ and $ symbols.

/** "22" -> 34, "AA" -> -86  */
def hex2byte(hex: String) = Integer.parseInt(hex, 16).toByte

/** decode strings like \x22 or \xC4\xB1\xC3\xA7 to specified encoding   */
def decodeHexadecimals(str: String, encoding: String="UTF-8") = 
  new String(str.split("""\\x""").tail.map(hex2byte), encoding)

/** fix weird strings */
def replaceHexadecimals(str: String, encoding: String="UTF-8") = 
  """(\\x[\dA-F]{2})+""".r.replaceAllIn(str, m => 
    util.matching.Regex.quoteReplacement(
      decodeHexadecimals(m.group(0), encoding)))

P.S. Does anyone know the difference between java.util.regex.Matcher.quoteReplacement and scala.util.matching.Regex.quoteReplacement?