How can I use Rayon to split a big range into chunks of ranges and have each thread find within a chunk?

Question

I am making a program that brute forces a password by parallelization. At the moment the password to crack is already available as plain text, I'm just attempting to brute force it anyway.

I have a function called generate_char_array() which, based on an integer seed, converts base and returns a u8 slice of characters to try and check. This goes through the alphabet first for 1 character strings, then 2, etc.

let found_string_index = (0..1e12 as u64).into_par_iter().find_any(|i| {
    let mut array = [0u8; 20];
    let bytes = generate_char_array(*i, &mut array);
    return &password_bytes == &bytes;
});

With the found string index (or seed integer rather), I can generate the found string.

The problem is that the way Rayon parallelizes this for me is split the arbitrary large integer range into thread_count-large slices (e.g. for 4 threads, 0..2.5e11, 2.5e11..5e11 etc). This is not good, because the end of the range is for arbitrarily super large password lengths (10+, I don't know), whereas most passwords (including the fixed "zzzzz" I tend to try) are much shorter, and thus what I get is that the first thread does all the work, and the rest of the threads just waste time testing way too long passwords and synchronizing; getting actually slower than single thread performance as a result.

How could I instead split the arbitrary big range (doesn't have to have an end actually) into chunks of ranges and have each thread find within chunks? That would make the workers in different threads actually useful.

There a few problems with your code. 1) Your array is 20 bytes, but `u64` holds 8 bytes. 2) Your iteration is from 0 - 1e12. To get all the `u64` values exhaustively your range should be `0 .. 0xffffffffffffffff`. — Peter Hall, May 10 '18 at 15:14
A way of doing this might be to reformulate it into two loops, The outer (parallel) loop is the first byte, and then the inner loop does exactly the same as you have now, except using the first digit from the outer loop. That way there'll be a thread per first byte, and they'll all start short and work up to longer ones. — Peter Hall, May 10 '18 at 15:18
@PeterHall good observation, but this was just a quick thing I wrote; ideally I should be using BigInts or whatever the Rust equivalent is. By the way also note that in my u64 each byte can store 256 different combinations of data, but in the array each byte can only store 26 different combinations of data, because at the moment I'm only working with the lower-case alphabetic characters. — Attila Szeremi, May 10 '18 at 16:32
Also in your example, if I have each outer loop work with the "first byte", I don't get how this would work, because I'd have at most 16 threads on a 16 thread Ryzen machine, whereas the first byte can have 256 different combinations. — Attila Szeremi, May 10 '18 at 16:33
Actually I could probably at runtime ensure that the number of threads would be a power of 2, and then take that number and make sure that each thread started on the bit level with its own number, and then iterate each thread ensuring the same starting bits. Not a bad idea! — Attila Szeremi, May 10 '18 at 16:42

Peter Hall · Answer 1 · 2018-05-11T00:30:58.213

1

This is a version of what I suggested in my comment.

The main loop is parallel and is only over the first byte of each attempt. For each first byte, do the full brute force search for the remainder.

let matched_bytes = (0 .. 0xFFu8).into_par_iter().filter_map(|n| {
    let mut array = [0u8; 8];
    // the first digit is always the same in this run
    array[0] = n;
    // The highest byte is 0 because it's provided from the outer loop
    (0 ..= 0x0FFFFFFFFFFFFFFF as u64).into_iter().filter_map(|i| {
        // pass a slice so that the first byte is not affected
        generate_char_array(i, &mut array[1 .. 8]);
        if &password_bytes[..] == &array[0 .. password_bytes.len()] {
            Some(array.clone())
        } else {
            None
        }
    }).next()
}).find_any(|_| true);

println!("found = {:?}", matched_bytes);

Also, even for a brute force method, this is probably highly inefficient still!

edited May 11 '18 at 00:30

answered May 10 '18 at 15:33

Peter Hall

53,120
14
139
204

The problem I see here is that the outer `filter_map()` callback is expected to run multiple times per thread to cover every possible combination of the first byte... however the inner loop will run a super long time per callback, resulting in it taking forever to cover each combination of first byte. – Attila Szeremi May 10 '18 at 17:33
So say we weren't dealing with byte arrays, but rather character arrays. Your code would nicely split up the brute force work between two cores to try all combinations starting with a-m for the first core, and n-z for the second. So the first core will try `a`, then `aa` all the way to `az`, then `aaa` to `azz` etc all the way to `azzzzzzz` before it would guess the imaginary correct password: `ba` – Attila Szeremi May 10 '18 at 17:37
Yes. That's true. Shepmaster's approach is better. – Peter Hall May 11 '18 at 00:30

Shepmaster · Accepted Answer · 2018-05-10T17:03:55.613

1

This goes through the alphabet first for 1 character strings, then 2

You wish to impose some sequencing on your data processing, but the whole point of Rayon is to go in parallel.

Instead, use regular iterators to sequentially go up in length and then use parallel iterators inside a specific length to quickly process all of the values of that length.

Since you haven't provided enough code for a runnable example, I've made this rough approximation to show the general shape of such a solution:

extern crate rayon;

use rayon::iter::{IntoParallelRefIterator, ParallelIterator};
use std::ops::RangeInclusive;

type Seed = u8;

const LENGTHS: RangeInclusive<usize> = 1..=3;
const SEEDS: RangeInclusive<Seed> = 0..=std::u8::MAX;

fn find<F>(test_password: F) -> Option<(usize, Seed)>
where
    F: Fn(usize, Seed) -> bool + Sync,
{
    // Rayon doesn't support RangeInclusive yet
    let seeds: Vec<_> = SEEDS.collect();

    // Step 1-by-1 through the lengths, sequentially
    LENGTHS.flat_map(|length| {
        // In parallel, investigate every value in this length
        // This doesn't do that, but it shows how the parallelization
        // would be introduced
        seeds
            .par_iter()
            .find_any(|&&seed| test_password(length, seed))
            .map(|&seed| (length, seed))
    }).next()
}

fn main() {
    let pass = find(|l, s| {
        println!("{}, {}", l, s);
        // Actually generate and check the password based on the search criteria
        l == 3 && s == 250
    });

    println!("Found password length and seed: {:?}", pass);
}

This can "waste" a little time at the end of each length as the parallel threads spin down one-by-one before spinning back up for the next length, but that seems unlikely to be a primary concern.

edited May 10 '18 at 17:03

answered May 10 '18 at 16:08

Shepmaster

388,571
95
1,107
1,366

It doesn't look like this code would work. Remember, there are always 26^length possible password combinations, so different possible amount of combinations depending on the length. True though that I could determine the ranges needed to be checked for each length, and just iterate those; as long as I can incrementally tests passwords in parallel in order of length, the performance can be good... – Attila Szeremi May 10 '18 at 16:49
By the way I don't intend any strict sequencing of data processing, but what I was trying to do was to check in batches; so one thread would check, say, ints from 0..100000, another would grab 100000..20000, etc. and each thread is to be working on its batch, and when it's done, would grab a new batch. – Attila Szeremi May 10 '18 at 16:50
@AttilaSzeremi it isn't supposed to "work" — how could it considering you haven't provided 95% of your own code? What it does do is show the shape of the solution you need to implement. I'll clarify. – Shepmaster May 10 '18 at 16:51
Hang on, let me provide the code that worked with the help of the information you have provided me... – Attila Szeremi May 10 '18 at 17:10
@AttilaSzeremi you are encouraged to post your own answer with that. – Shepmaster May 10 '18 at 17:11
So here it is: https://github.com/amcsi/rust-experimenting-password-cracker/commit/8f16495ecac4da0c13e6e45360cfc37a8439947c I got like a 6x speed improvement with parallelization on my 8 core CPU. That's a big success! – Attila Szeremi May 10 '18 at 17:16
I actually also tried linear range iterations rather than incremental ones, and turns out that it doesn't scale. Probably has to do with the spinning down thing you were talking about. – Attila Szeremi May 10 '18 at 17:52

score 0 · Answer 3 · edited May 11 '18 at 16:28

If Rayon splits the slices as you described, then apply simple math to balance the password lengths:

let found_string_index = (0..max_val as u64).into_par_iter().find_any(|i| {
    let mut array = [0u8; 20];
    let v = i/span + (i%span) * num_cpu;

    let bytes = generate_char_array(*v, &mut array);
    return &password_bytes == &bytes;
});

The span value depends on the number of CPUs (the number of threads used by Rayon), in your case:

let num_cpu = 4;
let span = 2.5e11 as u64;
let max_val = span * num_cpu;

Note the performance of this approach is highly dependent on how Rayon performs the split of the sequence on parallel threads. Verify that it works as you reported in the question.

How can I use Rayon to split a big range into chunks of ranges and have each thread find within a chunk?

3 Answers3