1

I have a byte array (or more precisely a ByteString) of UTF8 strings, which are prefixed by their length as 2-bytes (msb, lsb). For example:

val z = akka.util.ByteString(0, 3, 'A', 'B', 'C', 0, 5, 
        'D', 'E', 'F', 'G', 'H',0,1,'I')

I would like to convert this to a list of strings, so it should similar to List("ABC", "DEFGH", "I").

Is there an elegant way to do this?

(EDIT) These strings are NOT null terminated, the 0 you are seeing in the array is just the MSB. If the strings were long enough, the MSB would be greater than zero.

Will I Am
  • 2,614
  • 3
  • 35
  • 61

3 Answers3

1

Edit: Updated based on clarification in comments that first 2 bytes define an int. So I converted it manually.

def convert(bs: List[Byte]) : List[String] = {
  bs match {
    case count_b1 :: count_b2 :: t =>
      val count =  ((count_b1 & 0xff) << 8) | (count_b2 & 0xff)
      val (chars, leftover) = t.splitAt(count)
      new String(chars.toArray, "UTF-8") :: convert(leftover)
    case _ => List()
  }
}

Call convert(z.toList)

Lionel Port
  • 3,492
  • 23
  • 26
  • Thank you. These strings are not null terminated, sorry for the confusion. – Will I Am Dec 17 '14 at 23:44
  • Would it be faster to calculate length via bitwise operations? In my code I do something like count = (buff(1) & 0xff) | ((buff(0) & 0xff) << 8) – Will I Am Dec 18 '14 at 04:59
  • Yes. I kind of preferred the ByteBuffer as to me its more readable what the intention is but I guess if you wrap it in a function like bytes2int it would have the same effect. – Lionel Port Dec 18 '14 at 05:10
  • Cool. I also built my own solution using a recursive tail function (as hinted by Enzyme's link below) and now I need to figure out a good way to benchmark and see which is best. Since i'm processing network packets, I am performance sensitive. – Will I Am Dec 18 '14 at 05:15
  • Just for reference, 10000 iterations of the allocate version took 180ms, and the bite-manipulation version took 61ms. Now I am going to try and fix some of the other versions to try them out – Will I Am Dec 18 '14 at 17:00
  • Just out of curiosity, I converted your approach to a tailrec call, and shaved off another 20 ms off of 1 million loop. Wheee. Thanks! I will need to test it to make sure this works ok for large strings I cannot post here, but it looks good for 90% of my use cases. – Will I Am Dec 18 '14 at 22:22
0

Here is my answer with foldLeft.

def convert(z : ByteString) = z.foldLeft((List() : List[String], ByteString(), 0, 0))((p, b : Byte) => {
  p._3 match {
    case 0 if p._2.nonEmpty => (p._2.utf8String :: p._1, ByteString(), -1, b.toInt)
    case 0 => (p._1, p._2, -1, b.toInt)
    case -1 => (p._1, p._2, (p._4 << 8) + b.toInt, 0)
    case _ => (p._1, p._2 :+ b, p._3 - 1, 0)
  }
})

It works like this:

scala> val bs = ByteString(0, 3, 'A', 'B', 'C', 0, 5,  'D', 'E', 'F', 'G', 'H',0,1,'I')
scala>   val k = convert(bs); (k._2.utf8String :: k._1).reverse
k: (List[String], akka.util.ByteString, Int, Int) = (List(DEFGH, ABC),ByteString(73),0,0)
res20: List[String] = List(ABC, DEFGH, I)
suztomo
  • 5,114
  • 2
  • 20
  • 21
  • Thank you. I will give this a try. Why do you think the strings are null terminated? The 0 in the array is just the MSB of the length. Unless I'm missing something. – Will I Am Dec 17 '14 at 22:44
  • Now I got it. The length part consists of 2 bytes! Then my solution won't work. I see others also assumed the zeros in array are separator. – suztomo Dec 17 '14 at 23:03
  • Sorry about that, I added a small clarification in the question for future reference. – Will I Am Dec 17 '14 at 23:21
  • Thanks. It seems like this version scales poorly. In my benchmarks, it's fairly equivalent to the version by Lionel below at small iterations up to about 10,000 (66ms vs 68ms). When I push it to 100,000 it becomes more than 2x slower (180ms vs 340ms). It kinda tapers off (at 1m, I get 1191ms vs 2943ms). – Will I Am Dec 18 '14 at 20:33
0

Consider multiSpan method as defined here which is a repeated application of span over a given list,

z.multiSpan(_ == 0).map( _.drop(2).map(_.toChar).mkString )

Here the spanning condition is whether an item equals 0, then we drop the first two prefixing bytes, and convert the remaining to a String.

Note On using multiSpan, recall to import annotation.tailrec .

Community
  • 1
  • 1
elm
  • 20,117
  • 14
  • 67
  • 113
  • Thank you. As discussed below with Suztomo in another answer, it doesn't account for the MSB/LSB prefix. These strings are not null terminated. I just happen to have the MSB as 0 so I don't have to post an array with a huge string. – Will I Am Dec 17 '14 at 23:36