2

Intel's "Optimization Reference Manual" mentions a new cpu feature "Fast Short REP CMPSB and SCASB" that could speed up string operations:

REP CMPSB and SCASB performance is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. When the Fast Short REP CMPSB and SCASB feature is enabled, REP CMPSB and REP SCASB performance is flat 15 cycles per operation, for all strings 1-128 byte long whose two source operands reside in the processor first level cache.

Support for fast short REP CMPSB and SCASB is enumerated by the CPUID feature flag: CPUID.07H.01H:EAX.FAST_SHORT_REP_CMPSB_SCASB[bit 12] = 1.

Fast Short REP MOVSB explicitly mentions support

Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced

But I could not find any information about which cpu generation started supporting "Fast Short REP CMPSB".

Jörn Horstmann
  • 33,639
  • 11
  • 75
  • 118
  • https://stackoverflow.com/a/43837448/17034 – Hans Passant Feb 01 '23 at 12:58
  • 1
    Thanks, that answer describes rep movsb/rep stosb quite well, but does not seem to mention cmpsb/scasb – Jörn Horstmann Feb 01 '23 at 15:03
  • Interesting; in current Intel CPUs `rep scasb` and `rep cmpsb` are total disasters, not optimized at all in microcode, just doing one byte-load per cycle, so one compare per 1 or 2 cycles. (Unlike with movs/stos, which have been fast since at least P6 "fast strings" support, at least for non-overlapping src / dst and DF=0, so even without ERMSB or Fast short rep movsb, it's not a total disaster on older CPUs, like rep scasb is: [Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?](//stackoverflow.com/q/55563598) (gcc fixed that by not using rep scasb at -O1) – Peter Cordes Feb 01 '23 at 16:53
  • 1
    Google searching didn't find any mention of CPUs with this feature. I guess you could look at CPUID dumps for recent CPUs on http://users.atw.hu/instlatx64/ and manually check EAX bit 12 in the right leaf in case Raptor Lake has it. Unfortunately they don't seem to have a Sapphire Rapids CPU, and I didn't notice any engineering samples of upcoming Intel CPUs like I think they've sometimes had in the past. – Peter Cordes Feb 01 '23 at 17:04
  • 1
    Looks like no for Raptor Lake, `0x00400810 & (1<<12)` is zero. http://users.atw.hu/instlatx64/GenuineIntel/GenuineIntel00B0671_RaptorLake_02_CPUID.txt – Peter Cordes Feb 01 '23 at 17:10

1 Answers1

2

CPUID dump for Core i5-12500 (which only has performance cores, no efficiency cores) Shows support for this feature.

Dumps for 1350P and 1365U also show support.

Interestingly I did not see it in any of the other 13x00 cores.

InstLatX64 on twitter also pointed me to the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" saying the following:

Fast Short REP CMPSB, fast short REP SCASB 4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture

Jörn Horstmann
  • 33,639
  • 11
  • 75
  • 118
  • No, none of the cores in the i5-13500 support FSRC because it's hybrid. The Intel Optimization Manual tells you in Section 3.8 that support for FSRC started in GLC, which is correct. Keep in mind that the current hybrid configurations force the architectures of the P-core and E-core microarchitectures to match by reducing them to the least common denominator feature set. – Hadi Brais Feb 06 '23 at 20:38
  • 1
    @HadiBrais: That's weird in this case, it's not a new instruction, it's just a performance feature. The different cores are already microarchitecturally different. I guess it makes sense that if a program makes tuning choices at startup, though, you'd rather pick a strategy that doesn't face-plant after migrating to the E cores. I'd guess short-rep cmps/scas might actually *be* fast on the P cores, they just don't advertise it via a feature bit on hybrid chips. (i.e. they probably wouldn't use different microcode on P cores that are part of a chip with E cores, except for CPUID.) – Peter Cordes Feb 08 '23 at 17:32
  • @PeterCordes It's a microarchitectural feature, but in the context of the hybrid design, it's treated like an architectural feature because of the way it'll normally be used, which is doing a check at startup or at the time of generating native code. After that point, there is the expectation that the checked features will always be available. This is convenient and simplifies things but at the expense of reduced potential of the hybrid design. It may be supported without being advertised, though, features untested post silicon are normally fused off to not leak unknown behavior, I think. – Hadi Brais Feb 08 '23 at 19:24
  • @HadiBrais: Yeah, that's pretty much what I guessed. But this isn't the kind of feature you can "fuse off", is it? I expect it's more a matter of the microcode implementation of `rep cmpsb`, to emit a fixed number of uops with the later ones being predicated on the count, so they run as NOPs if RCX was smaller than their threshold. IIRC, [we already know the startup strategy for `rep movsb` works something like that](https://stackoverflow.com/a/33905887), even without the "fast short rep movs" feature. That would explain why OoO exec around a `rep movs` works for counts below a threshold. – Peter Cordes Feb 08 '23 at 19:45
  • @HadiBrais: The other reason for guessing that they'd keep the feature but not advertize is via CPUID is that there *are* CPUs with only P cores which do advertize the feature. Testing and validating the core with one microcode implementation is easier than validating two different implementations. And if there is any dedicated hardware to support this, it has to exist and work in some cores that are microarchitecturally the same, so fusing it off doesn't save them anything. – Peter Cordes Feb 08 '23 at 19:49
  • @PeterCordes I do think it's fully implemented in microcode, but it's possible that the microcode for GLC has both implementations with a flag (or some bits of the address of a micro branch) that can be fused off to choose one of them, depending on chip defects and binning. Chips with working and faulty microcode routines could end up in the same bin. All chips in a bin have to be "pruned" to simplify marketing. Otherwise, if only the fast implementation is present, then it'll always be enabled irrespective of the CPUID support flag. – Hadi Brais Feb 08 '23 at 22:03