0

i'm asking what's the best way to shift characters in a string to right or left in x86 assembly using Irvine library. there's an example: ABCD --> DABC and so on

I've written this code but it give me wrong result.

r1:
push ecx
mov ecx,lengthof arr
mov al,[esi+lengthof arr]
mov bl,[esi]
mov [esi],al
mov [esi+1],bl
inc esi
innr1:
mov al,[esi]
mov bl,[esi+1]
mov [esi],al
inc esi
loop innr1
pop ecx
loop r1
Zeyad Etman
  • 2,250
  • 5
  • 25
  • 42
  • 1
    The loop `innr1:` does nothing but load `al` and store it where it came from, and `bl` is ignored. – Weather Vane Dec 24 '16 at 08:16
  • Also (unless `lengthof arr` is the last *index* and not the *length*) you have off-by-one both in the flrst load, and in the loop counter. – Weather Vane Dec 24 '16 at 08:20
  • If the `innr1` would do something like `str[esi+1] = str[esi]`, it would either overwrite whole string, or at least the second char, as `mov [esi+1],bl` overwrites char which was not read yet => lost forever. BTW, what debugger you do use? Did you really write so much code without even trying if first `mov al,[esi+...]` loads last char? Must be hard to code in Assembly like that, respect, looks a bit like masochism. But you also don't show definition of `arr`. IMO that's misunderstanding how programming in Assembly works. Data first, code is secondary. – Ped7g Dec 24 '16 at 11:17

1 Answers1

5

In the special case of a 4-byte string like your example, use rol dword ptr [arr], 8 to do the rotate you described.

(Remember that x86 is little-endian, so left-shifts inside a multi-byte operand move bytes to higher addresses).

In non-special cases, just implement a memmove() with a normal copy loop to shift bytes over, and copy the byte that has to wrap around. (You might want to load the byte that wraps around before entering the copy loop, so you can overwrite the location where it was stored.)


The best way to do this (for performance) is probably with SSE movups. rep movsb has high startup overhead, and is slower on misaligned data. And probably doesn't work well with overlapping destinations, but I don't remember seeing that mentioned.

If that wasn't what you meant by "best", be more specific and say "easiest to understand" or something.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    `MOVS` was *designed* to work also on overlapping data by adding DF, so in case of OP ABCD -> DABC he would have to set `esi/edi` to end of buffer, and `STD` to make the `MOVS` go backwards. The highly optimized `memmove(medium/large_size)` in 80386 times included prologue of moving few bytes to 4B align, then `MOVSD` to do the job, and epilogue of finishing remaining %4 bytes. Plus it had two branches of course, depending on the overlapping direction. – Ped7g Dec 24 '16 at 11:10
  • @Ped7g: Thanks for describing how to achieve correctness with `movsb` / `movsd`. I wouldn't want to bet on it being high performance, though, when the dest is within 16B of the source. When they're distant, the optimized microcode implementation copies 16B at a time. And unless you're rotating by 16, either the src or dst must be unaligned, which `rep movsb` doesn't like (lower throughput as well as higher startup overhead for Intel's implementation.) – Peter Cordes Dec 25 '16 at 22:59
  • @Ped7g: To achieve correctness and high performance with SSE, I guess you can just make sure you go in the direction that makes overlap go the way you want. I haven't thought it through, but you might need to save the part that "wraps around" in regs or a tmp buffer during the copy loop. Or you could pipeline the copy loop's loads/store so you have at least a couple registers of data in flight at once. This lets you do 16B unaligned loads from parts of the buffer that haven't been touched by (aligned) stores yet. Exit the loop at the wrap-around point, and store the in-flight vectors. – Peter Cordes Dec 25 '16 at 23:08
  • @Ped7g: BTW, one of the main reasons I wouldn't bet on `rep movsd` to work efficiently with nearby src/dst is that Intel's implementation [uses a cache protocol that avoids RFO for the stores](http://stackoverflow.com/a/33905887/224132), so it's optimized for writing every byte in a cache line without reading it first. (This is since the original P6, not just the IvyBridge weakly-ordered-stores ERMSB enhancement). – Peter Cordes Dec 26 '16 at 00:01