Why is this version of strrev faster than mine?

Question

I can't read assembly code, so my assumptions may be completely wrong!

Here's my code :

void reverse(char* str)
{
    size_t size = strlen(str) / 2;
    char tmp;
    for (int i = 0; i < size; ++i)
    {
        tmp = str[size - i - 1];
        str[size - i - 1] = str[size + i];
        str[size + i] = tmp;
    }
}

And here's the asm output :

000000000000073a <reverse>:
 73a:   55                      push   %rbp
 73b:   48 89 e5                mov    %rsp,%rbp
 73e:   48 83 ec 20             sub    $0x20,%rsp
 742:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
 746:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 74a:   48 89 c7                mov    %rax,%rdi
 74d:   e8 9e fe ff ff          callq  5f0 <strlen@plt>
 752:   48 d1 e8                shr    %rax
 755:   48 89 45 f8             mov    %rax,-0x8(%rbp)
 759:   c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%rbp)
 760:   eb 72                   jmp    7d4 <reverse+0x9a>
 762:   8b 45 f4                mov    -0xc(%rbp),%eax
 765:   48 98                   cltq   
 767:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
 76b:   48 29 c2                sub    %rax,%rdx
 76e:   48 89 d0                mov    %rdx,%rax
 771:   48 8d 50 ff             lea    -0x1(%rax),%rdx
 775:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 779:   48 01 d0                add    %rdx,%rax
 77c:   0f b6 00                movzbl (%rax),%eax
 77f:   88 45 f3                mov    %al,-0xd(%rbp)
 782:   8b 45 f4                mov    -0xc(%rbp),%eax
 785:   48 63 d0                movslq %eax,%rdx
 788:   48 8b 45 f8             mov    -0x8(%rbp),%rax
 78c:   48 01 c2                add    %rax,%rdx
 78f:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 793:   48 01 d0                add    %rdx,%rax
 796:   8b 55 f4                mov    -0xc(%rbp),%edx
 799:   48 63 d2                movslq %edx,%rdx
 79c:   48 8b 4d f8             mov    -0x8(%rbp),%rcx
 7a0:   48 29 d1                sub    %rdx,%rcx
 7a3:   48 89 ca                mov    %rcx,%rdx
 7a6:   48 8d 4a ff             lea    -0x1(%rdx),%rcx
 7aa:   48 8b 55 e8             mov    -0x18(%rbp),%rdx
 7ae:   48 01 ca                add    %rcx,%rdx
 7b1:   0f b6 00                movzbl (%rax),%eax
 7b4:   88 02                   mov    %al,(%rdx)
 7b6:   8b 45 f4                mov    -0xc(%rbp),%eax
 7b9:   48 63 d0                movslq %eax,%rdx
 7bc:   48 8b 45 f8             mov    -0x8(%rbp),%rax
 7c0:   48 01 c2                add    %rax,%rdx
 7c3:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 7c7:   48 01 c2                add    %rax,%rdx
 7ca:   0f b6 45 f3             movzbl -0xd(%rbp),%eax
 7ce:   88 02                   mov    %al,(%rdx)
 7d0:   83 45 f4 01             addl   $0x1,-0xc(%rbp)
 7d4:   8b 45 f4                mov    -0xc(%rbp),%eax
 7d7:   48 98                   cltq   
 7d9:   48 39 45 f8             cmp    %rax,-0x8(%rbp)
 7dd:   77 83                   ja     762 <reverse+0x28>
 7df:   90                      nop
 7e0:   c9                      leaveq 
 7e1:   c3                      retq

And here's the other version:

void strrev2(unsigned char *str)
{
    int i;
    int j;
    unsigned char a;
    unsigned len = strlen((const char *)str);
    for (i = 0, j = len - 1; i < j; i++, j--)
    {
        a = str[i];
        str[i] = str[j];
        str[j] = a;
    }
}

And the asm:

00000000000007e2 <strrev2>:
 7e2:   55                      push   %rbp
 7e3:   48 89 e5                mov    %rsp,%rbp
 7e6:   48 83 ec 20             sub    $0x20,%rsp
 7ea:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
 7ee:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 7f2:   48 89 c7                mov    %rax,%rdi
 7f5:   e8 f6 fd ff ff          callq  5f0 <strlen@plt>
 7fa:   89 45 fc                mov    %eax,-0x4(%rbp)
 7fd:   c7 45 f4 00 00 00 00    movl   $0x0,-0xc(%rbp)
 804:   8b 45 fc                mov    -0x4(%rbp),%eax
 807:   83 e8 01                sub    $0x1,%eax
 80a:   89 45 f8                mov    %eax,-0x8(%rbp)
 80d:   eb 4d                   jmp    85c <strrev2+0x7a>
 80f:   8b 45 f4                mov    -0xc(%rbp),%eax
 812:   48 63 d0                movslq %eax,%rdx
 815:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 819:   48 01 d0                add    %rdx,%rax
 81c:   0f b6 00                movzbl (%rax),%eax
 81f:   88 45 f3                mov    %al,-0xd(%rbp)
 822:   8b 45 f8                mov    -0x8(%rbp),%eax
 825:   48 63 d0                movslq %eax,%rdx
 828:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 82c:   48 01 d0                add    %rdx,%rax
 82f:   8b 55 f4                mov    -0xc(%rbp),%edx
 832:   48 63 ca                movslq %edx,%rcx
 835:   48 8b 55 e8             mov    -0x18(%rbp),%rdx
 839:   48 01 ca                add    %rcx,%rdx
 83c:   0f b6 00                movzbl (%rax),%eax
 83f:   88 02                   mov    %al,(%rdx)
 841:   8b 45 f8                mov    -0x8(%rbp),%eax
 844:   48 63 d0                movslq %eax,%rdx
 847:   48 8b 45 e8             mov    -0x18(%rbp),%rax
 84b:   48 01 c2                add    %rax,%rdx
 84e:   0f b6 45 f3             movzbl -0xd(%rbp),%eax
 852:   88 02                   mov    %al,(%rdx)
 854:   83 45 f4 01             addl   $0x1,-0xc(%rbp)
 858:   83 6d f8 01             subl   $0x1,-0x8(%rbp)
 85c:   8b 45 f4                mov    -0xc(%rbp),%eax
 85f:   3b 45 f8                cmp    -0x8(%rbp),%eax
 862:   7c ab                   jl     80f <strrev2+0x2d>
 864:   90                      nop
 865:   c9                      leaveq 
 866:   c3                      retq

Why is the second version faster (I assume it is, because there are less instructions) and why does objdump produce more assembly instructions for my code?

My code uses less memory, but I thought it would also be faster, because I only increment one variable (i) and I don't cast when using strlen().

Ext3h · Accepted Answer · 2020-08-20T18:51:30.693

That piece here: size - i - 1

That is ruining the performance for you, as that calculation is actually being performed every single loop iteration.

Your assumption about using "less memory" is wrong. These variables didn't even end up in memory, in neither of the algorithms, but were kept purely within registers. So there was no memory access to eliminate in the first place, the only thing your optimization achieved was to introduce additional arithmetic which is now slowing down the loop.

The most complex form of addressing x86 arch can handle in a single instruction is variable[variable + constant]. Any more complex than that, and the pointer arithmetic has to be performed with multiple instructions instead.

Also, the compiler unrolled the code, correctly estimating the effects of up to 3 iterations in a row. For the code with i and j that means incrementing only once every 3 iterations, and using constant offsets in between. For your code, it meant redoing the address calculation over and over again.

Yes, usually they are. In fact, pointers are about the biggest data type you can expect to be kept in registers *unconditionally*, until you are running out of then. Everything bigger than a pointer (e.g. 128bit types) only fits into special registers, of which there are significantly fewer. And if it's a `struct` it's going to end up in memory for certain, unless the compiler did manage to rip it apart during optimization. — Ext3h, Aug 20 '20 at 19:14

score 1 · Answer 2 · answered Aug 20 '20 at 18:54

The statement i++ and j++ can be translated to one assembly instruction that increment a register by 1.

When you do arithmetic indexing, it has to load size to register, subtract it with i and write to another register. There are 4 such operations within the while loop.

Vlad from Moscow · Answer 3 · 2020-08-20T18:52:41.283

The both functions are bad and wrong.

For example the first function does not work correctly with strings that have an odd value of the length.

Here is a demonstrative program.

#include <stdio.h>
#include <string.h>

void reverse(char* str)
{
    size_t size = strlen(str) / 2;
    char tmp;
    for (int i = 0; i < size; ++i)
    {
        tmp = str[size - i - 1];
        str[size - i - 1] = str[size + i];
        str[size + i] = tmp;
    }
}

int main(void) 
{
    char s[] = "123";
    
    reverse( s );
    
    puts( s );
    
    return 0;
}

The program output is

In the function there are mixed the types int and size_t that can result in an infinite loop.

In the second function there is used incorrectly the type unsigned int instead of the type size_t and again there are mixed the types int and unsigned int.

void strrev2(unsigned char *str)
{
    int i;
    int j;
    unsigned char a;
    unsigned len = strlen((const char *)str);
    for (i = 0, j = len - 1; i < j; i++, j--)
    {
        a = str[i];
        str[i] = str[j];
        str[j] = a;
    }
}

So the both functions are very bad written.

And the functions should be declared like

char * reverse( char * );

So there is no great sense to compare which bad function is faster.:)

I think such a function usually is written using an assembler.

Using C I would write the function the following way as it is shown in the demonstrative program below.

#include <stdio.h>
#include <string.h>

char * reverse( char * s )
{
    if ( *s )
    {
        for ( char *p = s, *q = s + strlen( s ); p < --q; ++p )
        {
            char c = *p;
            *p = *q;
            *q = c;
        }
    }
    
    return s;
}

int main(void) 
{
    char s[] = "123";
    
    puts( reverse( s ) );
    
    return 0;
}

@Ext3h ehm, it certainly does have `size_t`, and so does `ptrdiff_t`. https://port70.net/~nsz/c/c89/c89-draft.html - and why did `strcpy` return the target... — Antti Haapala -- Слава Україні, Aug 20 '20 at 19:13
@Ext3h You are mistaken. It is a common convention of string standard functions to return pointer to the result string. The second version is just bad. Mixing signed int and unsigned int can result in invalid loop. Only low-qualified programmers can call this version ideal.:) I have nothing to append. — Vlad from Moscow, Aug 20 '20 at 19:15

Support Ukraine · Answer 4 · 2020-08-21T06:17:12.437

To start with: If you want to compare anything, you need to make sure that you compare two pieces of code that behave the same. Anyway...

Why is the linux version faster(I assume it is, because there are less instructions)

You can't just count the number of instructions and conclude that the one with less instructions is the fastest.

Just like C code there can be loops in assembly code.

For instance one piece of assembly may loop 100 times over the same 3 instructions and another piece (doing the same) may have unrolled the loop to (e.g.) 200 instructions without any loop.

So even if the second has way more instruction, it may still be significantly faster.

There are many other reasons that you can't just compare assembly code to find the fastest piece of code. Several advanced features exist at at hw-level, e.g. branch prediction, cache effects, out-of-order execution, instruction inter-dependencies impacting pipeline stalls, etc. How such things impacts the execution time of a specific piece of code is something only "extreme experts in the specific processor/system" can judge solely by looking at assembly code. If your not an "extreme expert" the only good way to find the fastest piece of code is to measure execution time.

Brett Hale · Answer 5 · 2020-08-20T19:51:22.513

0

Keep it simple, and avoid any explicit indexing:

#include <string.h>

...

void my_strrev (char *str)
{
    char *rev = str + strlen(str) - 1;

    while (str < rev)
    {
        char ci = *str, cj = *rev;
        *str++ = cj, *rev-- = ci; /* (exchange) */
    }
}

Pointer comparison is well-defined here, as they are both addresses of elements in the same 'array' (or contiguous memory region). This yields a tight loop that fits within the instruction cache, and is easy to understand. Also, I'd recommend using -O2 for any real profiling.

edited Aug 20 '20 at 19:51

answered Aug 20 '20 at 19:40

Brett Hale

21,653
2
61
90

When compiled with -O2 , the output of your version is the same as VladFromMoscow's. You may want to check this one out. https://stackoverflow.com/questions/8145449/how-come-my-array-index-is-faster-than-pointer. I don't know if the answers there are correct, but they state that never ever pointers are supposed to be faster than array index. – S.Sot Aug 20 '20 at 19:49
@S.Sot - both approaches are dereferencing a pointer. This version avoids the indexing form. It may not be any faster, but would certainly not be any slower. – Brett Hale Aug 20 '20 at 19:55

Why is this version of strrev faster than mine?

5 Answers5