Optimizing C code for ARM-based devices

Question

Recently, I've stumbled upon an interview question where you need to write a code that's optimized for ARM, especially for iphone:

Write a function which takes an array of char (ASCII symbols) and find the most frequent character.

char mostFrequentCharacter(char* str, int size)

The function should be optimized to run on dual-core ARM-based processors, and an infinity amount of memory.

On the face of it, the problem itself looks pretty simple and here is the simple implementation of the function, that came out in my head:

#define RESULT_SIZE 127

inline int set_char(char c, int result[])
{
    int count = result[c];
    result[c] = ++count;
    return count;
}

char mostFrequentChar(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char current_char;
    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(size_t i = 0; i<size; i++)
    {
        current_char = str[i];
        current_char_frequency = set_char(current_char, result);

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = current_char;
        }
    }

    return frequent_char;
}

Firstly, I did some basic code optimization; I moved the code, that calculates the most frequent char every iteration, to an additional for loop and got a significant increase in speed, instead of evaluating the following block of code size times

if(current_char_frequency >= char_frequency)
{
    char_frequency = current_char_frequency;
    frequent_char = current_char;
}

we can find a most frequent char in O(RESULT_SIZE) where RESULT_SIZE == 127.

char mostFrequentCharOpt1(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(int i = 0; i<size; i++)
    {
        set_char(str[i], result);
    }

    for(int i = 0; i<RESULT_SIZE; i++)
    {
        current_char_frequency = result[i];

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    return frequent_char;
}

Benchmarks: iPhone 5s

size = 1000000
iterations = 500

// seconds = 7.842381
char mostFrequentChar(char str[], int size)

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

In average, the mostFrequentCharOpt1 works in ~24% faster than basic implementation.

Type optimization

The ARM cores registers are 32-bits long. Therefore, changing all local variables that has a type char to type int prevents the processor from doing additional instructions to account for the size of the local variable after each assignment.

Note: The ARM64 provides 31 registers (x0-x30) where each register is 64 bits wide and also has a 32-bit form (w0-w30). Hence, there is no need to do something special to operate on int data type. infocenter.arm.com - ARMv8 Registers

While comparing functions in assembly language version, I've noticed a difference between how the ARM works with int type and char type. The ARM uses LDRB instruction to load byte and STRB instruction to store byte into individual bytes in memory. Thereby, from my point of view, LDRB is a bit slower than LDR, because LDRB do zero-extending every time when accessing a memory and load to register. In other words, we can't just load a byte into the 32-bit registers, we should cast byte to word.

Benchmarks: iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)

Changing char type to int didn't give me a significant increase of speed on iPhone 5s, by way of contrast, running the same code on iPhone 4 gave a different result:

Benchmarks: iPhone 4

size = 1000000
iterations = 500

// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)

// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)

Loop optimization

Next, I did a loop optimization, where, instead of incrementing i value, I decremented it.

before    
for(int i = 0; i<size; i++) { ... }

after
for(int i = size; i--) { ... }

Again, comparing assembly code, gave me a clear distinction between the two approaches.

mostFrequentCharOpt2                                              |      mostFrequentCharOpt3
0x10001250c <+88>:  ldr    w8, [sp, #28] ; w8 = i                 |      0x100012694 <+92>:  ldr    w8, [sp, #28]                                             ; w8 = i
0x100012510 <+92>:  ldr    w9, [sp, #44] ; w9 = size              |      0x100012698 <+96>:  sub    w9, w8, #1 ; w9 = i - 1                                           
0x100012514 <+96>:  cmp    w8, w9 ; if i<size                     |      0x10001269c <+100>: str    w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge   0x100012548 ; if true => end loop      |      0x1000126a0 <+104>: cbz    w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine                    |      0x1000126a4 <+108>: ... set_char start routine
...                                                               |      ...
0x100012534 <+128>: ... set_char end routine                      |      0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr    w8, [sp, #28] ; w8 = i                 |      0x1000126c0 <+136>: b      0x100012694 ; back to the first line
0x10001253c <+136>: add    w8, w8, #1 ; i++                       |      0x1000126c4 <+140>: ...
0x100012540 <+140>: str    w8, [sp, #28] ; save i to $sp+28       |      
0x100012544 <+144>: b      0x10001250c ; back to the first line   |      
0x100012548 <+148>: str    ...                                    |

Here, in place of accessing size from the memory and comparing it with the i variable, where the i variable, was incrementing, we just decremented i by 0x1 and compared the register, where the i is stored, with 0.

Benchmarks: iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization

// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization

Threading optimization

Reading the question accurately gives us at least one more optimization. This line ..optimized to run on dual-core ARM-based processors ... especially, dropped a hint to optimize the code using pthread or gcd.

int mostFrequentCharThreadOpt(char str[], int size)
{
    int s;
    int tnum;
    int num_threads = THREAD_COUNT; //by default 2
    struct thread_info *tinfo;

    tinfo = calloc(num_threads, sizeof(struct thread_info));

    if (tinfo == NULL)
        exit(EXIT_FAILURE);

    int minCharCountPerThread = size/num_threads;
    int startIndex = 0;

    for (tnum = num_threads; tnum--;)
    {
        startIndex = minCharCountPerThread*tnum;

        tinfo[tnum].thread_num = tnum + 1;
        tinfo[tnum].startIndex = minCharCountPerThread*tnum;
        tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
        tinfo[tnum].str = str;

        s = pthread_create(&tinfo[tnum].thread_id, NULL,
                           (void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
        if (s != 0)
            exit(EXIT_FAILURE);
    }

    int frequent_char = 0;
    int char_frequency = 0;
    int current_char_frequency = 0;

    for (tnum = num_threads; tnum--; )
    {
        s = pthread_join(tinfo[tnum].thread_id, NULL);
    }

    for(int i = RESULT_SIZE; i--; )
    {
        current_char_frequency = 0;

        for (int z = num_threads; z--;)
        {
            current_char_frequency += tinfo[z].resultArray[i];
        }

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    free(tinfo);

    return frequent_char;
}

Benchmarks: iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

Note: mostFrequentCharThreadOpt works slower than mostFrequentCharOpt2 on iPhone 4.

Benchmarks: iPhone 4

size = 1000000
iterations = 500

// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

Question

How well optimized is the mostFrequentCharOpt3 and mostFrequentCharThreadOpt, in other words: are there any other methods to optimize both methods?

Source code

How many cores are there on these phones, and how much cache does each core have? Once main memory bandwidth becomes the bottleneck, then using additional threads won't help and make make the algorithm slower. — rcgldr, Sep 22 '15 at 01:56
have you tried using GCD for your threading? (its technically C) it might be better than manually using pthreads — Fonix, Sep 22 '15 at 02:19
The A4 and the A6 are both dual core, why would you expect more than two threads to help in a program that involves no I/O or synchronisation? — Iskar Jarak, Sep 22 '15 at 03:08
I am surprised some of these optimization methods do anything at all, is compiler optimization on? Or must everything be optimized by hand only? — koldewb, Sep 22 '15 at 06:11
@auselen Well, what optimization level I should use if I'd like to see the result of optimization? — tikhop, Sep 22 '15 at 08:54
Optimizations off is where compiler fills your code with clutter well it's not baseless but its more for lets say debugging. Play with this one's switches and see the assembly generated http://goo.gl/N0T048 — auselen, Sep 22 '15 at 10:12
btw I'm thinking of writing some answer later but think of ill suited cases where there are lots of consecutive characters. that's what generally histogram algorithms get hits. — auselen, Sep 22 '15 at 10:14
Typically on -O0 the compiler more or less compiles each statement on its own in a vacuum, so you get a load of needless register-shuffling, pushing things to the stack only to read them straight back, and other generally stupid waste-of-time code between simple instruction templates for each individual operation. Trying to "optimise" that behaviour from the high-level code is largely an exercise in futility, and at worst will give misleading results and leave you with horrible unreadable code that can make things _harder_ for the optimiser to do its job if you _do_ turn it on. — Notlikethat, Sep 22 '15 at 11:12
You should move the pthread_create/pthread_join calls out of the function and let the worker threads remain at all times. You will synchronize with them in a lighter way, for example with semaphores. — , Sep 22 '15 at 15:52
I'm fairly sure the gist of this question is partition the task (with separate frequency counters per partition) using vector processing. As discussed at [http://stackoverflow.com/questions/12985949/methods-to-vectorise-histogram-in-simd][here] it's not exactly easy. — marko, Sep 23 '15 at 22:58

score 1 · Answer 1 · answered Sep 22 '15 at 06:48

Alright, the following things you can try, I can't 100% say what will be effective in your situation, but from experience, if you put all possible optimizations off, and looking at the fact that even loop optimization worked for you: your compiler is pretty numb.

It slightly depends a bit on your THREAD_COUNT, you say its 2 at default, but you might be able to spare some time if you are 100% its 2. You know the platform you work on, don't make anything dynamic without a reason if speed is your priority.

If THREAD == 2, num_threads is a unnecessary variable and can be removed.

int minCharCountPerThread = size/num_threads;

And the olden way to many discussed topic about bit-shifting, try it:

int minCharCountPerThread = size >> 1; //divide by 2

The next thing you can try is unroll your loops: multiple loops are only used 2 times, if size isn't a problem, why not remove the loop aspect? This is really something you should try, look what happens, and if it useful too you. I've seen cases loop unrolling works great, I've seen cases loop unrolling slows down my code.

Last thing: try using unsigned numbers instead if signed/int (unless you really need signed). It is known that some tricks/instruction are only available for unsigned variables.

score 1 · Answer 2 · answered Nov 17 '15 at 20:16

There are quite a few things you could do, but the results will really depend on which specific ARM hardware the code is running on. For example, older iPhone hardware is completely different than the newer 64 bit devices. Totally different hardware arch and diff instruction set. Older 32 bit arm hardware contained some real "tricks" that could make things a lot faster like multiple register read/write operation. One example optimization, instead of loading bytes you load while 32 bit words and then operate on each byte in the register using bit shifts. If you are using 2 threads, then another approach can be to break up the memory access so that 1 memory page is processed by 1 thread and then the second thread operates on the 2nd memory page and so on. That way different registers in the different processors can do maximum crunching without reading or writing to the same memory page (and memory access is the slow part typically). I would also suggest that you start with a good timing framework, I built a timing framework for ARM+iOS that you might find useful for that purpose.

Optimizing C code for ARM-based devices

2 Answers2