4

I am reading the "strlen" source code from the glibc, and the trick developers found to speed it up is to read n bytes where n is the size of a long word, instead of reading 1 byte at each iteration.

I will assume that a long word has 4 bytes.

The tricky part is that every "chunk" of 4 bytes the function reads can contain a null byte, so at each iteration, the function has to check if there was a null byte in the chunk. They do it like

if (((longword - lomagic) & ~longword & himagic) != 0) { /* null byte found */ }

where longword is the chunk of data and himagic and lowmagic are magical values defined as:

himagic = 0x80808080L;
lomagic = 0x01010101L;

Here is the comment for thoses values

/* Bits 31, 24, 16, and 8 of this number are zero.  Call these bits
 the "holes."  Note that there is a hole just to the left of
 each byte, with an extra at the end:

 bits:  01111110 11111110 11111110 11111111
 bytes: AAAAAAAA BBBBBBBB CCCCCCCC DDDDDDDD

 The 1-bits make sure that carries propagate to the next 0-bit.
 The 0-bits provide holes for carries to fall into.  */

How does this trick of finding the null byte work?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Brendan Rius
  • 610
  • 9
  • 18
  • @WeatherVane I don't think you understand the question. The point is that testing a character string 8 bytes at a time may speed up the operation. However, it would have been less confusing if the OP didn't include his own broken code. OP: please rewrite your question purely in terms of the 32-bit version you are apparently referencing. Also, there must be some code to determine *which* byte in the word is the zero byte. – ooga Dec 05 '14 at 18:29
  • So the question is of intellectual interest? – Weather Vane Dec 05 '14 at 18:34
  • @WeatherVane A little. ;-) – ooga Dec 05 '14 at 18:42
  • Yes the question is of intellectual interest. Also, I did not include my "own broken code" (original code: https://github.com/lattera/glibc/blob/master/string/strlen.c). Anyway, I'll make the code purely 32-bit if you want – Brendan Rius Dec 05 '14 at 18:48
  • Related: [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/q/57650895) - it only uses this portable C fallback on ISA where it doesn't have hand-written asm. Some use this bithack in asm, others (like x86-64) use SIMD. – Peter Cordes Jun 08 '22 at 21:37

1 Answers1

8

From the famous "Bit Twiddling Hacks" page By Sean Eron Anderson, a description of what is currently used in the glibc implementation you're referring to (Anderson calls the algorithm hasless(v, 1)):

The subexpression (v - 0x01010101UL), evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80. The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80). Finally, by ANDing these two sub-expressions the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second.

It appears that the comment(s) in the glibc source is confusing because it doesn't apply to what the code is actually doing anymore - it's describing what would have been an implementation of the algorithm that Anderson describes just before the hasless(v, 1) algorithm is described.

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
  • Interestingly, another comment in the code says that the condition may be true even when there is no zero byte. `/* Which of the bytes was the zero? If none of them were, it was a misfire; continue the search. */` – ooga Dec 05 '14 at 18:50
  • Interesting! Thanks for the info. – ooga Dec 05 '14 at 18:58
  • Well yes it's interesting stuff, but you have a trade-off overhead when you find a string terminator and then need to split the 4 or 8 byte int, reorganise and continue. I've tried *various* techniques over the years of one kind or another, which are supposed to speed *other* things up, but usually they are not worth the candle if you can find another way to grease a bottleneck, and by the time I had it all properly documented for my successor so that the code was readable and maintainable, processor speeds had doubled ;-) – Weather Vane Dec 05 '14 at 19:01