Strange behaviour in multithread analysis

Question

for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known. We have implemented several versions including one that exploits the multithreading mechanism in C++. The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.

    //Structs for threads and respective data
    pthread_t threads[num_of_threads];
    struct bf_data td [num_of_threads];
    int rc;
    //Space division
    uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
    uintmax_t step = index/num_of_threads;

    if(sem_init(&s, 1, 0)!=0){
        printf("Error during semaphore initialization\n");
        return -1;
    }

    for(int i = 0; i < num_of_threads; i++){
        //Structure initialization
        td[i].ciphertext = ciphertext;
        td[i].hacked_key = hacked_key;
        td[i].iv_aes = iv_aes;
        td[i].key = key_aes;
        td[i].num_bits_to_hack = num_bits_to_hack;
        td[i].plaintext = plaintext;
        td[i].starting_point = step*i;
        td[i].step = step;
        td[i].num_of_threads = num_of_threads;
        if(DEBUG)
            printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);

        rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);

        if (rc){
            cout << "Error:unable to create thread," << rc << endl;
            exit(-1);
        }
    }
    sem_wait(&s);
    for(int i = 0; i < num_of_threads; i++){
        pthread_join(threads[i], NULL);
    }

For the decryption_brute_force function (The body of each thread):

void* decryption_brute_force(void* data){
   ** Copy data on local thread memory
   ** Build the key to begin the search from starting point
   ** for each key from starting_point to starting_point + step
       ** Try decryption
       ** if obtained plaintext corresponds to the expected one
          ** Print results, wake up main thread and terminate
       ** else
          ** increment the key and continue

}

To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it. At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure. Click here to see the plot. We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?

Many reasons are possible : https://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency, locking of datastructures gets to be limiting factor, cache misses become predominant. Have you run your code in a profiler yet? That can help you indicate where your code start stalling. E.g on waiting for locks or retrieving data and more. So yes it could definitely be your processor architecture, what kind of hardware do you run on? If you have more threads then cores you will start losing out for sure. — Pepijn Kramer, Jan 29 '23 at 09:58
@PepijnKramer The function you suggested return 16 as optimal threads number... The configuration that returns the worst performances. I run it on a 11th Gen Intel(R) Core(TM) i7-11800H. Actually it could be cache misses but I don't explain why with 32 threads the results improve again, they should increase cache misses. Thank you for your time! — NperNedo, Jan 29 '23 at 10:06
Do you have an estimate of the uncertainty of your measurements, i.e. statistical and systematic error bars? In other words: are you sure that the finding is significant? — Manfred, Jan 29 '23 at 10:10
How do you separate/share search-space? Can you at least write a pseudo-algorithm here? — huseyin tugrul buyukisik, Jan 29 '23 at 10:15
@Manfred To extract meaningful data we run 10 reps for each configuration and we estimate the confidence interval at 99% (With a student T) for each result. I did not insert them in the plot since they are very small, so they don't overlap. — NperNedo, Jan 29 '23 at 10:16
its not possible to explain the result of your analysis (or find its flaw) when all you present is a chart. Much more details are needed. Posting a [mcve] of the benchmark would be a start — 463035818_is_not_an_ai, Jan 29 '23 at 10:17
@huseyintugrulbuyukisik I've updated the question with the section where I allocate and starts the threads — NperNedo, Jan 29 '23 at 10:21
@NperNedo does number of work per element depend on the element value? For example, does it loop for ```some_data_value``` times per data? — huseyin tugrul buyukisik, Jan 29 '23 at 10:23
@huseyintugrulbuyukisik Yeah, actually the decryption_brute_force function cycle on a value that depends on the amount of keys that needs to be tested and the amount of threads that the system allocate (in order to try them all) — NperNedo, Jan 29 '23 at 10:25
Side note, it will not improve performance but please use std::thread (or std::async) in C++ not pthread. And then use std::mutex and std::scoped_lock/std::unique_lock these are RAII objects and will always unlock a mutex even in the case of errors. — Pepijn Kramer, Jan 29 '23 at 10:25
The actual code to start your threads is not all that intersting in itself, the algorithm itself and how it accesses data is more important. — Pepijn Kramer, Jan 29 '23 at 10:26
Also std::vector is more flexible then using a "C" style array with a fixed number of threads. pthread_t threads[num_of_threads]; is not standard C++ if num_of_threads is not a constant. — Pepijn Kramer, Jan 29 '23 at 10:27
@NperNedo ```description_brute_force``` can we have some code about this please? Maybe even a pseudo-code? — huseyin tugrul buyukisik, Jan 29 '23 at 10:27
what you posted is some fragment of the code, most declarations are missing, definition o the function doing the work is missing, the way you measure the time is missing, and last but not least the flags you used to compile are missing — 463035818_is_not_an_ai, Jan 29 '23 at 10:28
Don't forget to join with your threads (or synchronize with your futures). I can't see that in the code you posted. — Pepijn Kramer, Jan 29 '23 at 10:29
Well, yes, optimal number of threads will be affected by CPU architecture (and cache/memory architecture). As a first cut (assuming each thread does calculations fairly intensely (e.g. a loop running continuously, without significant waiting [say, I/O] or synchronisation [with other threads]) then the optimal number of threads will be proportional to the number of available cores (number of processors times cores per processor). In more advanced processors. support of features such as hyperthreading have an effect. Considerations like cache coherency between cores may also be relevant. — Peter, Jan 29 '23 at 10:29
@NperNedo ```Try decryption```: is this fixed steps or variable steps in itself? — huseyin tugrul buyukisik, Jan 29 '23 at 10:36
@huseyintugrulbuyukisik Fixed steps, classical AES-CBC decryption steps — NperNedo, Jan 29 '23 at 10:38
@PepijnKramer I forgot to insert that piece code here in the question but I actually did it, sorry — NperNedo, Jan 29 '23 at 10:39
@NperNedo last question, are you using only a single encrypted data (that yields to always same key to find) to benchmark for all settings? — huseyin tugrul buyukisik, Jan 29 '23 at 10:41
@Peter Im starting to think that cache coherency is the point — NperNedo, Jan 29 '23 at 10:41
@huseyintugrulbuyukisik Yes, don't worry ask whatever you want :) — NperNedo, Jan 29 '23 at 10:42
`void* decryption_brute_force(void* data)` C++ should not use void* (unless in rare cases where type-erasure is needed). If all you are going to do is reading from your data then do not copy it to local thread memory — Pepijn Kramer, Jan 29 '23 at 10:42
@NperNedo `uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);` -- Do not use floating point functions such as `pow` to solve integer-based problems. [See the reason why](https://stackoverflow.com/questions/25678481/why-does-pown-2-return-24-when-n-5-with-my-compiler-and-os). — PaulMcKenzie, Jan 29 '23 at 11:06
Here is an outline of what I would write to do all the thread handling https://onlinegdb.com/9LOnrk5sj (just to show what C++ can offer). Again not directly related to your question. — Pepijn Kramer, Jan 29 '23 at 11:10

huseyin tugrul buyukisik · Accepted Answer · 2023-01-29T11:09:24.817

0

From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?

             this variable is integer (so it may not be exact distribution)
             ^
8 threads & step=7 (56 work total)
                   index-16 (0-based)
                   v
 01234567 89abcdef 01234567 89abcdef
|        |        |.       |        ...
500 seconds as its the first loop iteration

16 threads & step=3 (56 work total)
                      index-16 again, but at second-iteration now
                      v
 012 345 678 9ab cde f01 234 567 8
|   |   |   |   |   | . |   |   |   ...
1000 seconds as it finds only after second iteration in the thread

Another example with 2 threads and 3 threads:

x to found at 51-th element of 100-element-work:
2 threads
|                     |x(1st iteration)       |
3 threads
|             |........x     |                |
5x slower than 2 threads

edited Jan 29 '23 at 11:09

answered Jan 29 '23 at 10:43

huseyin tugrul buyukisik

11,469
4
45
97

I had thought about this problem, after several tests (including numerical) we concluded that the starting points that have the 8 threads are also covered by the 16-thread execution. For example, if I have 256 keys to test with 8 threads I will get that the starting_point of each thread will be 0,**32**,**64** ,..., **224**. In the case with 16 threads the starting_points instead will be: 0,16,**32**,48,**64**,...,**224**,240 . So as you can see the starting points are the same and what you state (however true in some cases) impossible in this. – NperNedo Jan 29 '23 at 10:53
I updated illustration with a table now. With integer step value it may have this issue? – huseyin tugrul buyukisik Jan 29 '23 at 10:56
Simpler example would be 2 thread and 3 threads. If work space is 100 elements, if divided into 50 and 33, and if element to find is at 51, then going 3 threads from 1 thread would make it 18x slower. – huseyin tugrul buyukisik Jan 29 '23 at 11:06
@PepijnKramer yes your way is better for sharing work. I'd also make some atomic-integer based work-stealing just to make sure any number of threads can work (at least until context-switching shows lag). – huseyin tugrul buyukisik Jan 29 '23 at 11:14

Strange behaviour in multithread analysis

1 Answers1