What is practical application of x86 RCL/RCR instructions?

Question

I'm interested in practical applications, even if they are outdated by modern standards.

There's similar question, about ROL and ROR here, but it doesn't really answer about RCL/RCR.

I can come up with some applications for RCL, RCR with operand 1 (i.e. for some LFSRs), but i can't think of any sensible application with non 1 operand.

So can anyone enlighten me?

P.S. sample code is more than welcomed.

update 1: as Peter Cordes mentioned in comments below, one (quite obvious) application is shrd/shld. (IIRC rcl/rcr instructions were already in 8080)

Maybe 'non 1' above was not clear, but mind that I'm mostly interested in usage, where operand is != 1 (RC(L|R) REG, c with c being either > 1 or == cl).

shifting bits between registers one-at-a-time was one use-case, before `shrd` / `shld` existed. Or doing a 32-bit rotate across two 16-bit registers. — Peter Cordes, Apr 05 '19 at 21:38
You put x86-64 as a tag which triggered this comment. The rcl/rcr come from the 8-bit era and are still there in x86 due to backward compatibility requirements. I find the question interesting and will try to come up with something tomorrow. I worked quite a lot with z80 (which has similar instructions). — tum_, Apr 05 '19 at 22:56
@PeterCordes - for others here, note that `shrd` / `shld` only updates the destination register (with bits from source register shifted in), leaving the source register unshifted. Although this means a second instruction in the case of a two register (or 2 memory) shift, it is useful for an extended precision shift in the case of a more than 2 register (or 2 memory) shift. — rcgldr, Apr 06 '19 at 00:19
In addition to Peter Cordes' answer: There may still be applications where a long value (e.g. a 320 bit number) must be shifted. In this case you shift the first 32 (or 64) bits using `shl` (or `shr` or `sar`) and the remaining N*32 (or N*64) bits using N `rcl` (or `rcr`) operations. — Martin Rosenau, Apr 06 '19 at 06:46
@MartinRosenau: If you're shifting by only 1 bit, then yes that's worth considering. Otherwise SSE2 or better AVX2 shuffles + shifts are clearly a better bet. e.g. GMP's SSE2 version of [`mpn_lshift`](https://gmplib.org/manual/Low_002dlevel-Functions.html) uses `psllq` / `psrlq` / `por`, and `punpcklqdq`. https://gmplib.org/repo/gmp/file/tip/mpn/x86_64/fastsse/lshift.asm, with various cases depending on alignment. (Of course if we're not talking about x86-64, then SSE2 might not be available. And variable-count `shld` is not super fast on Intel; GMP doesn't seem to use it.) — Peter Cordes, Apr 06 '19 at 07:29
Yes, rcl/rcr by implicit 1 and by CL [existed in 8086](http://www.posix.nl/linuxassembly/nasmdochtml/nasmdoca.html), presumably for consistency with the encoding of other shifts. Then 286 added the imm8 count versions of all shifts, including these. Then 386 added `shld`/`shrd` by imm8 or by CL. — Peter Cordes, Apr 06 '19 at 09:24
I'm not aware of any simple use-case for RCL/RCL with count other than 1. Sorry I missed that you were asking about count != 1 uses, that's a really good question :P. It might well be *just* for consistency of machine encoding and so they're not a special case in the decodes. — Peter Cordes, Apr 06 '19 at 09:26
yeah, consistency is my guess as well, that's why i'm interested if there's some known sensible usage — GiM, Apr 06 '19 at 09:33
yeah, a *known* sensible usage of "rotating more than a single bit through carry" is not an easy task to find, indeed. The closest approximation I can think of is a hypothetical case where you have something that requires exactly 9 bits (0-511 range) to be represented and you need to rotate it for whatever purpose. Then the opcode seems to be the perfect fit :) — tum_, Apr 06 '19 at 11:43

Martin Rosenau · Answer 1 · 2019-04-06T08:44:32.350

In shifting operations, these instructions have the same role as the the add-with-carry (adc) or subtract-with-carry (sbb) instructions in additions:

It is used as second instruction when processing numbers that are longer than the maximum size of a CPU register so the number must be processed using multiple operations.

Example: On a 386 CPU you can perform 32-bit operations using a single instruction. However, you might want to process 320-bit integer numbers.

Let's say we have a 4-bit CPU and we want to perform a "arithmetic right shift" (sar) operation on a 16-bit integer number:

Integer: ABCDEFGHIJKLMNOP  (A-P = some bits that may be 1 or 0)

Operation on a 16 bit CPU:

    ABCDEFGHIJKLMNOP (SAR 1) -> AABCDEFGHIJKLMNO, CF = P

Operation on a 4 bit CPU:

    ABCD (SAR 1) -> AABC, CF = D
    EFGH, CF = D (RCR 1) -> DEFG, CF = H
    IJKL, CF = H (RCR 1) -> HIJK, CF = L
    MNOP, CF = L (RCR 1) -> LMNO, CF = P

So the final result on the 4-bit CPU is AABCDEFGHIJKLMNO, CF = P

Of course the same example would work with a 256-bit number on a 64-bit CPU...

Please also note:

Using add/adc, sub/sbc or shl/rcl we start at the low bits and continue with the high bits. However, using shr/rcr or sar/rcr it is the other way round.

Also worth mentioning that `adc x,x` is exactly equivalent to `rcl x, 1` as far as reading/setting CF, but faster. (`rcl rax,1` is a 3 uop instruction on Skylake, but `adc rax,rax` is single-uop https://agner.org/optimize/. Rotate-by-1 sets extra flags, but not *all* flags, so it decodes to a flag-merging uop. Variable-count `rcl` is even slower, but wouldn't have many use-cases even if it was fast, AFAIK.) So `rcl` is only interesting if your data is in memory. `rcr` can't be emulated that easily, though. — Peter Cordes, Apr 06 '19 at 07:47
Fun fact: on AVR (an 8-bit RISC), [`rol` is a pseudo-instruction for `adc same,same`](https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_ROL.html). (AVR rotates are always through carry.) — Peter Cordes, Apr 06 '19 at 07:49

What is practical application of x86 RCL/RCR instructions?

1 Answers1