2

I am writing an small communication protocol with TCP sockets. I am able to read and write basic data types such as integers but I have no idea of how to read an UTF-8 encoded string from a slice of bytes.

The protocol client is written in Java and the server is Go.

As per I read: GO runes are 32 bit long and UTF-8 chars are 1 to 4 byte long, what makes not possible to simply cast a byte slice to a String.

I'd like to know how can I read and write this UTF-8 stream.

Note I have the byte buffer length on time to read the string.

Mikhas
  • 851
  • 1
  • 12
  • 31

1 Answers1

5

Some theory first:

  • A rune in Go represents a Unicode code point — a number assigned to a particular character in Unicode. It's an alias to uint32.
  • UTF-8 is a Unicode encoding — a format of representing Unicode code points for the means of storage and transmission. UTF-8 might use 1 to 4 bytes to encode a single code point.

How this maps on Go data types:

  • Both []byte and string store a series of bytes (a byte in Go is an alias for uint8).

    The chief difference is that strings are immutable, so while you can

      b := make([]byte, 2)
      b[0] = byte('a')
      b[1] = byte('z')
    

    you can't

      var s string
      s[0] = byte('a')
    

    The latter fact is even underlined by the inability to set the string length explicitly (like in imaginary s := make(string, 10)).

  • While strings in Go contain abstract bytes (you're free to store in them, say, characters encoded using Windows-1252), certain Go statements and type conversions interpret strings as being encoded in UTF-8, in particular:

    • A type conversion between string and []rune parses the string as a sequence of UTF-8-encoded code points and produces a slice of them. The reverse type conversion takes the Unicode code points from the slice of runes and produces an UTF-8-encoded string.
    • A range loop over a string loops through Unicode code points comprising the string, not just bytes.

Go also supplies the type conversions between string and []byte and back. Now recall that strings are read-only, while slices of bytes are not. This means a construct like

b := make([]byte, 1000)
io.ReadFull(r, b)
s := string(b)

always copies the data, no matter if you convert a slice to a string or back. This wastes space but is type-safe and enforces the semantics.

Now back to your task at hand.

If you work with reasonably small strings and are not under memory pressure, just convert your byte slices filled by io.Read() (or whatever) to strings. Be sure to reuse the slice you're using to read the data to ease the pressure on the garbage collector — that is, do not allocate a new slice for each new read as you're gonna to copy the data put to it by the reading code off to a string.

Finally, if you absolutely must to not copy the data (say, you're dealing with multi-megabyte strings, and you have tight memory requirements), you may try to play dirty tricks by unsafely working with memory — here is an example of how you might transplant the memory from a byte slice to a string. Note that should you revert to something like this, you must very well understand that it's free to break with any new release of Go, and it's not even guaranteed to work at all.

maerics
  • 151,642
  • 46
  • 269
  • 291
kostix
  • 51,517
  • 14
  • 93
  • 176
  • Just to clarify: when I am "converting" a byte slice into a string, go not just "cast" the bytes but construct a new string? – Mikhas Nov 25 '13 at 23:00
  • 2
    @Mikhas, that is correct: the data bytes are copied over. This is one of very few places in Go where its developers made a pragmatic choice to allow this *hidden cost* in exchange for simplicity of a type conversion. Think again: if Go would just "cast" a byte slice's data to a string, the original slice would not somehow automatically seize to exist after casting, and you would be able to change the contents of the string *guaranteed to be read-only* through the slice which would share the data with the string. That would violate the semantics (and that's what the `implantSlice` hack does). – kostix Nov 26 '13 at 04:27