39

I've been reading up on Google Protocol Buffers recently, which allows for a variety of scalar value types to be used in messages.

According to their documentation, there's three types of variable-length integer primitives - int32, uint32, and sint32. In their documentation, they note that int32 is "Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead." But if you have a field that has no negative numbers, I assume that uint32 would be a better type to use than int32 anyways (due to the extra bit and decreased CPU cost of processing negative numbers).

So when would int32 be a good scalar to use? Is the documentation implying that it's most efficient only when you rarely get negative numbers? Or is it always preferable to use sint32 and uint32, depending on the contents of the field?

(The same questions apply to the 64-bit versions of these scalars as well: int64, uint64, and sint64; but I left them out of the problem description for readability's sake.)

Tim Sylvester
  • 22,897
  • 2
  • 80
  • 94
Dan Lew
  • 85,990
  • 32
  • 182
  • 176

2 Answers2

51

I'm not familiar with Google Protocol Buffers, but my interpretation of the documentation is:

  • use uint32 if the value cannot be negative
  • use sint32 if the value is pretty much as likely to be negative as not (for some fuzzy definition of "as likely to be")
  • use int32 if the value could be negative, but that's much less likely than the value being positive (for example, if the application sometimes uses -1 to indicate an error or 'unknown' value and this is a relatively uncommon situation)

Here's what the docs have to say about the encodings (http://code.google.com/apis/protocolbuffers/docs/encoding.html#types):

there is an important difference between the signed int types (sint32 and sint64) and the "standard" int types (int32 and int64) when it comes to encoding negative numbers. If you use int32 or int64 as the type for a negative number, the resulting varint is always ten bytes long – it is, effectively, treated like a very large unsigned integer. If you use one of the signed types, the resulting varint uses ZigZag encoding, which is much more efficient.

ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on...

So it looks like even if your use of negative numbers is rare, as long as the magnitude of the numbers (including non-negative numbers) you're passing in the protocol is on the smaller side, you might be better off using sint32. If you're unsure, profiling would be in order.

Community
  • 1
  • 1
Michael Burr
  • 333,147
  • 50
  • 533
  • 760
9

There is very little good reason to ever use int* rather than sint*. The existence of these extra types is most likely for historical, backwards compatibility reasons, which Protocol Buffers tries to maintain even across its own protocol versions.

My best guess is that in the earliest version they dumbly encoded negative integers in 2's complement representation, which requires the maximally sized varint encoding of 9 octets (not counting the extra type octet). Then they were stuck with that encoding so as not to break old code and serializations that already used it. So, they needed to add a new encoding type, sint*, to get a better variably sized encoding for negative numbers while not breaking existing code. How the designers didn't realize this issue from the get-go is utterly beyond me.

The 64b varint encoding (without type specification, which requires 1 more octet) can encode an unsigned integer value in the following number of octets:

[0, 2^7): one octet

[2^7, 2^14): two octets

[2^14, 2^21): three octets

[2^21, 2^28): four octets

[2^28, 2^35): five octets

[2^35, 2^42): six octets

[2^42, 2^49): seven octets

[2^49, 2^56): eight octets

[2^56, 2^64): nine octets

If you want to similarly encode small magnitude negative integers compactly then you will need to "use up" one bit to indicate the sign. You can do this through an explicit sign bit (at some reserved position) and magnitude representation. Or, you can do zig zag encoding, which effectively does the same thing by left shifting the magnitude by 1 bit and subtracting 1 for negative numbers (so the least significant bit indicates the sign: evens are non-negative, odds are negative).

Either way, the cut over points at which positive integers require more space now comes a factor of 2 earlier:

[0, 2^6): one octet

[2^6, 2^13): two octets

[2^13, 2^20): three octets

[2^20, 2^27): four octets

[2^27, 2^34): five octets

[2^34, 2^41): six octets

[2^41, 2^48): seven octets

[2^48, 2^55): eight octets

[2^55, 2^63): nine octets

To make the case to use int* over sint*, negative numbers would have to be extremely rare, but possible, and/or the most common positive values you expect to encode would have to fall right around one of the cut over points that leads to a larger encoding in sint* as opposed to int* (e.g. - 2^6 vs. 2^7 leading to 2x encoding size).

Basically, if you are going to have numbers where some may be negative, then by default use sint* rather than int*. int* will very rarely be superior and usually won't even be worth the extra thought you have to dedicate towards judging whether it is worth it or not IMHO.

jschultz410
  • 2,849
  • 14
  • 22
  • 1
    And to understand your `[ )` notation: https://stackoverflow.com/a/4396303/4561887 – Gabriel Staples Aug 01 '19 at 23:46
  • Valid explanation, but note that (int64) negative values will take 10 bytes not 9 as 7-bits per byte runs out of steam just before 2^63, not 2^64, so setting the 64th (sign) bit requires another byte. See proposed edit. – SensorSmith Oct 06 '22 at 21:10
  • 1
    @SensorSmith Thanks for your suggestion, but that is incorrect. There are two ways to encode 64b varints, but both encodings 'extract' the one extra bit needed from the last octet as there is no point in reserving a bit for indicating yet another octet in the last octet as there can't be one. So, you effectively get 7 bits per octet, except you get 8 bits in the last one. 7b * 8 + 8b = 64b encoded in 9 octets. Which is nice and tidy and fully utilizes the space without weird things like "spare capacity" left over or multiple non-canonical representations of the same value, etc. – jschultz410 Oct 24 '22 at 16:07
  • 1
    Good to know; and very logical. Thanks for the clear and professional explanation. – SensorSmith Oct 24 '22 at 19:27
  • Still, ints are used in protobuf "well-known" types, like [duration](https://protobuf.dev/reference/protobuf/google.protobuf/#duration) or [timestamp](https://protobuf.dev/reference/protobuf/google.protobuf/#timestamp). – Alex Che May 15 '23 at 21:26