0

GNU libc has this implementation of strcspn(), which, quote,

Returns the length of the maximum initial segment of S which contains no characters from REJECT.

It's clever in its implementation in that it creates a 256-entry lookup table to make the operation of finding whether a character is in the reject set a simple O(1) array lookup, as opposed to e.g. the OpenBSD implementation which uses nested loops.

However, I'm curious how the c0/c1/c2/c3 loop below can access memory past the apparent end of str without a page fault or similar.

Does the C standard, for instance, guarantee that all strings, whether on the stack or heap or statically allocated, have their allocations aligned up to 4 bytes, so it's safe to access up to 3 past the last NUL?

I've added some comments in the code below; the original (as linked above) has none whatsoever.

/* Align a pointer by rounding down to closest size. */
#define PTR_ALIGN_DOWN(base, size) ...

size_t strcspn (const char *str, const char *reject)
{
  if ((reject[0] == '\0') || (reject[1] == '\0'))
    return __strchrnul (str, reject [0]) - str;

  // Build a lookup table containing all characters from `reject`;
  // due to the way the `do/while` loop below is constructed, `table`
  // will end up having `table[0] = 1` always, which works as an
  // exit condition.

  unsigned char p[256] = {0};
  unsigned char *s = (unsigned char*) reject;
  unsigned char tmp;
  do
    p[tmp = *s++] = 1;
  while (tmp);

  // Check the first 4 bytes "by hand".
  s = (unsigned char*) str;
  if (p[s[0]]) return 0;
  if (p[s[1]]) return 1;
  if (p[s[2]]) return 2;
  if (p[s[3]]) return 3;

  // Align the pointer (for whichever reason?)
  s = (unsigned char *) PTR_ALIGN_DOWN (s, 4);


  unsigned int c0, c1, c2, c3;
  do
    {
      s += 4;  // Loop over 4 characters at a time (the first 4 bytes were checked before)
      c0 = p[s[0]];
      c1 = p[s[1]];
      c2 = p[s[2]];
      c3 = p[s[3]];
    }
  // break if any of c0, c1, c2, or c3 is nonzero,
  // i.e. if we've either found one of the characters in `reject`, or a zero
  while ((c0 | c1 | c2 | c3) == 0);

  size_t count = s - (unsigned char *) str;
  // figure out the actual offset based on which of c0..3 is set
  return (c0 | c1) != 0 ? count - c0 + 1 : count - c2 + 3;
}
AKX
  • 152,115
  • 15
  • 115
  • 172

1 Answers1

4

That loop is preceded by the line:

s = (unsigned char *) PTR_ALIGN_DOWN (s, 4);

Which ensures that s is aligned to 4 bytes. The loop then reads 4 bytes at a time. This guarantees that even if it reads past the end of the string, the extra bytes read will still be within the same 4-byte word.

If you go strictly by the C standard, this results in undefined behavior. But glibc assumes that on the platforms it runs on, the page size is a multiple of 4 bytes, and that reading past the end of a buffer will not cause any issues as long as it's within the same page.

interjay
  • 107,303
  • 21
  • 270
  • 254
  • Sounds plausible! I wonder if that assumption is documented anywhere. – AKX May 28 '19 at 12:14
  • 2
    Note that this assumption is practically a given: Memory pages are generally large (4kiB), always aligned to their natural size, and always contain a power of 2 bytes. This follows directly from the way page-based addressing works. So you could just as easily read an entire cacheline (64 Bytes = 512 Bits = one AVX-512 register) from an aligned address and be fine: Either all 64 Bytes are mapped, or none, there's nothing in between. – cmaster - reinstate monica May 28 '19 at 12:22
  • 2
    About the only time this would fail is if running a very aggressive memory access checker, such as valgrind, which needs to be told if the function can be safely ignored. – Gem Taylor May 28 '19 at 13:03
  • 1
    @GemTaylor That's true. I think valgrind provides alternative implementations for glibc functions like `strcspn` in order to avoid this issue. – interjay May 28 '19 at 13:24