29

Let's convert string to []byte:

func toBytes(s string) []byte {
  return []byte(s) // What happens here?
}

How expensive is this cast operation? Is copying performed? As far as I see in Go specification: Strings behave like slices of bytes but are immutable, this should involve at least copying to be sure subsequent slice operations will not modify our string s. What happens with reverse conversation? Does []byte <-> string conversation involve encoding/decoding, like utf8 <-> runes?

zzzz
  • 87,403
  • 16
  • 175
  • 139
demi
  • 5,384
  • 6
  • 37
  • 57

3 Answers3

43

The []byte(s) is not a cast but a conversion. Some conversions are the same as a cast, like uint(myIntvar), which just reinterprets the bits in place. Unfortunately that's not the case of string to byte slice conversion. Byte slices are mutable, strings (string values to be precise) are not. The outcome is a necessary copy (mem alloc + content transfer) of the string being made. So yes, it can be costly in some scenarios.

EDIT: No encoding transformation is performed. The string (source) bytes are copied to the slice (destination) bytes as they are.

zzzz
  • 87,403
  • 16
  • 175
  • 139
  • 1
    It's not very clear how `string` internally stored, as utf8 or as runes? Can you tell about this? – demi Jan 17 '13 at 07:06
  • 2
    From http://golang.org/ref/spec#String_types: "The elements of strings have type byte and may be accessed using the usual indexing operations.". So no UTF-8, no runes. String is a numbered sequence of bytes, _any_ bytes. OTOH, many stdlib functions work only with UTF-8 encoded strings. – zzzz Jan 17 '13 at 07:14
  • 2
    @demi The representation of the string is a sort of byte array. However, strings are clearly intended to contain utf-8 encoding: the `range` statement applied on strings decodes utf-8 and returns code points. [See here in the spec](http://golang.org/ref/spec#For_statements). – lbonn Jan 17 '13 at 07:46
  • @jnml I expressed myself poorly. I didn't mean that a string must contain utf-8 encoded code points but that this is implicitly encouraged as there is native support for it. – lbonn Jan 17 '13 at 09:49
  • 6
    The book *"Programming in Go"* made the claim: *"The `[]byte(string)` conversion is very fast (O(1)) since under the hood the `[]byte` can simply refer to the string’s underlying bytes with no copying required. The same is true of the reverse conversion, `string([]byte);` again the underlying bytes are not copied, so the conversion is O(1)."* So it seems this claim was incorrect. – the system Jan 17 '13 at 15:22
  • 4
    FWIW, there's an [open issue](http://code.google.com/p/go/issues/detail?id=2205) requesting a feature where if the resulting byte array is never modified after a `[]byte(s)` conversion, it would use the underlying array provided by the string instead of doing a copy. Seems like a useful optimization on the surface. – the system Jan 17 '13 at 19:07
13

The conversion copies the bytes, but it also allocates space for the []byte on the heap. In cases where you convert strings to []byte repeatedly, you might save memory management time by reusing the []byte and using the copy command. (See http://golang.org/ref/spec#Appending_and_copying_slices and the special case about using a string as the source.)

In both cases of the conversion and the copy command, the copy itself is a straight byte copy which should run very quickly. I would expect the compiler to generate some kind of repeat move instruction that the CPU executes efficiently.

The reverse conversion, making a string out of a byte slice, definitely involves allocating the string on the heap. The immutability property forces this. Sometimes you can optimize by doing as much work as possible with []byte and then creating a string at the end. The bytes.Buffer type is often useful.

Chasing the red herring now, encoding and UTF-8 are not issues. Strings and []byte can both hold arbitrary data. The copy does not look at the data, it just copies it. Choose words carefully when saying things like strings are intended to contain UTF-8 or that this is encouraged. It is more accurate to simply note that some language features, such as the range clause of a for statement, interpret strings as UTF-8. Just learn what interprets strings as UTF-8 and what doesn't. Have non-UTF-8 in a string and need to range over it byte-wise? No problem, just don't use the range clause.

s := "string"
for i := 0; i < len(s); i++ {
    b := s[i]
    // work with b
}

This is idiomatic Go. It is not discouraged and it violates no intention. It simply iterates over the string byte-wise, which is sometimes just what you want to do.

Sonia
  • 27,135
  • 8
  • 52
  • 54
1

In complement to the answers above, in the latest go language specification, the special rule of type conversion between string and numeric types is declared as follows:

Specific rules apply to (non-constant) conversions between numeric types or to and from a string type. These conversions may change the representation of x and incur a run-time cost. All other conversions only change the type but not the representation of x.

cmhzc
  • 60
  • 1
  • 8