-1

I heard that

  • the string operations (e.g. strlen()) in the C standard library access and operate on the characters of a string, one character at a time.
  • Computers access a word in memory at a time.
  • Accessing a character at a time is inefficient and the time costs by the string operations are high.

Are the above true?

What solutions can be used for improving the time performance of string operations?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Tim
  • 1
  • 141
  • 372
  • 590
  • 4
    "Computers access a word at a time" is a pretty serious oversimplification these days. Vector units access chunks substantially larger than the machine's native word size. And if the data isn't in cache, then it gets fetched a *cache line* at a time. – Nate Eldredge Sep 28 '20 at 02:11
  • 3
    C string operations are inefficient because of [Schlemiel the Painter's algorithm](https://www.joelonsoftware.com/2001/12/11/back-to-basics/) – Antti Haapala -- Слава Україні Sep 28 '20 at 04:06
  • 1
    "Schlemiel" often is an issue in C code that manipulates null-terminated strings. However, my experience is that it's nearly always possible to store the length of the string so it doesn't need to be recalculated. Normally I'd wrap up the string with its length and some other common-used information in a `struct`, and have code that operates only on those structs. The efficiency or otherwise of determining the length usually becomes a non-issue. – Kevin Boone Sep 28 '20 at 07:28
  • 1
    @KevinBoone: Right, so a better example would be `strcpy` vs. `memcpy` or `strcmp` vs. `memcmp`; yes the explicit-length `mem` functions can typically be a little faster, even for runtime-variable length. (Both use SIMD on ISAs like x86, but `str` functions need more ALU instructions to find the ends of strings on the fly. Both mem and str functions of course are defined in terms of operating one byte at a time. But that's why we have the as-if optimization rule.) – Peter Cordes Sep 30 '20 at 11:56

1 Answers1

6

The assumption in the question is false. Optimized implementations of strlen and other string operations in fact work word-at-a-time.

The GNU C Library ("glibc") has hand-optimized assembly routines for this, such as this one for x86_64.

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • 1
    We've also addressed this specific point [at least once before](https://stackoverflow.com/q/20312340/364696); `strlen` is typically optimized to work a word at a time, because library functions can operate in ways that are technically undefined behavior when the behavior is effectively defined on all the systems that execute them. – ShadowRanger Sep 28 '20 at 02:03
  • A string in C is a char array ended with the null character. (Why) Is the naive implementations of string operations, which operate a character at a time, inefficient? – Tim Sep 28 '20 at 02:04
  • 4
    @ShadowRanger Also, hand-written assembly code is unrelated to the definedness rules of ISO C. – Kaz Sep 28 '20 at 02:07
  • @Tim It's all about cycles. The naive implementation performs one iteration of a loop per character. That loop takes a certain number of cycles to execute: that's the cycles per character. An implementation working with words can use fewer cycles per character. – Kaz Sep 28 '20 at 02:09
  • 1
    @Tim It may be that if the string is not in the L1 cache, it doesn't matter; the speed will be determined by how fast the processor can load the memory into the cache. This type of optimization shows the most drastic differences when the data is cached. – Kaz Sep 28 '20 at 02:11
  • Do you mean when multiple characters in a string are cached, then it doesn't matter to performance whether to operate on a character at a time? – Tim Sep 28 '20 at 02:22
  • 2
    Tim, I _think_ Kaz means the opposite. If the data _is_ cached, the "wide word" code will run faster. But, even if the data is _not_ cached, after the cache line is fetched, the wide fetch will still execute faster than the single char code [but, the time to fetch the cache line from RAM will dwarf the execution time of the loop]. – Craig Estey Sep 28 '20 at 02:30
  • @CraigEstey: DRAM bandwidth (and hardware prefetch) can easily keep up with a CPU reading only 1 byte per clock cycle on modern CPUs (e.g. well over 30GB/sec read bandwidth on a "client" CPU like Skylake, not a big Xeon, vs. 4 to 5 GHz = 4 to 5 GB/s on high-end CPUs running 1 byte per clock). So no, if your code sucks so much that really does read 1 char at a time, RAM is typically not a bottleneck for long strings. Even the much lower per-core bandwidth available to a core in a big Xeon can keep up with that. – Peter Cordes Sep 30 '20 at 12:01
  • 1
    @Kaz: A better example of `strlen` would probably be x86-64 SIMD as described in [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035), not the integer bithack. But if you do want that, you can link that question which focuses on the performance of that glibc code (although the accepted answer spends more time on possible C UB, not how it works). Also [Why is this code 6.5x slower with optimizations enabled?](https://stackoverflow.com/a/55589634) for a simple example of an x86-64 SIMD strlen (for 16-byte aligned inputs) – Peter Cordes Sep 30 '20 at 12:06
  • My answer on [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739) links x86-64 glibc strlen with some description of the hand-written asm. – Peter Cordes Sep 30 '20 at 12:06