Prefix search Algorithim using multi-cores

Question

I got a task to filter a list (vector) from words for a prefix. The algorithm is supposed to use modern multi-core processors.

the solution have is to use many threads to handle the list.

//      PrintWriter writer = new PrintWriter("C:\\DemoList.txt", "UTF-8");
//      
//      for(char i = 'A'; i<= 'Z'; i++) {
//          for(char j = 'A'; j<= 'Z'; j++) {
//              for(char n = 'A'; n<= 'Z'; n++) {
//                  for(char m = 'A'; m<= 'Z'; m++) {
//                      writer.println("" + i + j + n + m );
//                  }
//                      
//              }
//          }
//      }
    List<String> allLines = Files.readAllLines(Paths.get("C:\\", "DemoList.txt"));
    Collections.shuffle(allLines);
    String pattern = "AA";

    List<String> result = new ArrayList<>();
    int cores = Runtime.getRuntime().availableProcessors();
    int threadsNum = allLines.size() / cores;

    long start_time = System.nanoTime();

    for (String word : allLines) {
        if (word.startsWith(pattern))
            result.add(word);

    }

    long end_time = System.nanoTime();
    double difference = (end_time - start_time) / 1e6;
    System.out.println("Time difference in Milliseconds with Brute-Force: " + difference);

//With Parallisim:
    long new_start_time = System.nanoTime();

    List<String> filteredList = allLines.parallelStream().filter(s -> s.startsWith(pattern))
            .collect(Collectors.toList());

    long new_end_time = System.nanoTime();

    double new_difference = (new_end_time - new_start_time) / 1e6;
    System.out.println("Time difference in Milliseconds with Stream from Java 8: " + new_difference);

Result: Time difference in Milliseconds with Brute-Force: 33.033602 Time difference in Milliseconds with Stream from Java 8: 65.017069

Each thread should filter a range from the list.

Do you have a better idea? Do you think, that i should sort the original list and than doing binary search on it? should i use multi-threading also in the binary sort, or shall i use the Collections.sort? How would you implement that?

So you've established that multi-threading will speed it up? — ernest_k, Sep 11 '18 at 07:58
@AhmadAl-Khazraji It depends. For example the overhead you get with multithreading can eat up all your additional performance. So you have to profile this. Your current approch sounds god for a first test. — user743414, Sep 11 '18 at 08:47
You should definitely *not* use (considerably) more threads than cores. — isnot2bad, Sep 11 '18 at 09:06
How large is your list, i.e. how many elements do you want to filter? (Rough estimation) — isnot2bad, Sep 11 '18 at 09:24
This is not defind. my Algorithm should handle any amount of data. — Ahmad Al-Khazraji, Sep 11 '18 at 09:51

GPI · Accepted Answer · 2018-09-12T08:56:03.583

From your code sample, your method of time measurement borders on Micro Benchmarking, for which simply measuring time for a single execution is misleading.

You can see it discussed at length in the following StackOverflow post: How do I write a correct micro-benchmark in Java?

I've written a more accurate benchmark to obtain a more precise measurment of your sample code. The code has run on a QuadCore i7 with multithreading (but it's a laptop, due to power and heat management, results may be slightly biased against multithreaded code that produces more heat).

@Benchmark
public void testSequentialFor(Blackhole bh, Words words) {
    List<String> filtered = new ArrayList<>();
    for (String word : words.toSort) {
        if (word.startsWith(words.prefix)) {
            filtered.add(word);
        }
    }
    bh.consume(filtered);
}

@Benchmark
public void testParallelStream(Blackhole bh, Words words) {
    bh.consume(words.toSort.parallelStream()
            .filter(w -> w.startsWith(words.prefix))
            .collect(Collectors.toList())
    );
}

@Benchmark
public void testManualThreading(Blackhole bh, Words words) {
    // This is quick and dirty, bit gives a decent baseline as to
    // what a manually threaded partitionning can achieve.
    List<Future<List<String>>> async = new ArrayList<>();
    // this has to be optimized to avoid creating "almost empty" work units
    int batchSize = words.size / ForkJoinPool.commonPool().getParallelism();
    int numberOfDispatchedWords = 0;
    while (numberOfDispatchedWords < words.toSort.size()) {
        int start = numberOfDispatchedWords;
        int end = numberOfDispatchedWords + batchSize;
        async.add(words.threadPool.submit(() -> {
            List<String> batch = words.toSort.subList(start, Math.min(end, words.toSort.size()));
            List<String> result = new ArrayList<>();
            for (String word : batch) {
                if (word.startsWith(words.prefix)) {
                    result.add(word);
                }
            }
            return result;
        }));
        numberOfDispatchedWords += batchSize;
    }
    List<String> result = new ArrayList<>();
    for (Future<List<String>> asyncResult : async) {
        try {
            result.addAll(asyncResult.get());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    bh.consume(result);
}

@State(Scope.Benchmark)
public static class Words {

    ExecutorService threadPool = ForkJoinPool.commonPool();

    List<String> toSort;

    @Param({"100", "1000", "10000", "100000"})
    private int size;

    private String prefix = "AA";

    @Setup
    public void prepare() {
        //a 4 to 13 letters long, random word
        //for more precision, it should not be that random (use a fixed seed), but given the simple nature of the fitlering, I guess it's ok this way
        Supplier<String> wordCreator = () -> RandomStringUtils.random(4 + ThreadLocalRandom.current().nextInt(10));
        toSort = Stream.generate(wordCreator).limit(size).collect(Collectors.toList());
    }
}

Here are the results

Benchmark                     (size)   Mode  Cnt        Score       Error  Units
PerfTest.testManualThreading     100  thrpt    6    95971,811 ±  1100,589  ops/s
PerfTest.testManualThreading    1000  thrpt    6    76293,983 ±  1632,959  ops/s
PerfTest.testManualThreading   10000  thrpt    6    34630,814 ±  2660,058  ops/s
PerfTest.testManualThreading  100000  thrpt    6     5956,552 ±   529,368  ops/s
PerfTest.testParallelStream      100  thrpt    6    69965,462 ±   451,418  ops/s
PerfTest.testParallelStream     1000  thrpt    6    59968,271 ±   774,859  ops/s
PerfTest.testParallelStream    10000  thrpt    6    29079,957 ±   513,244  ops/s
PerfTest.testParallelStream   100000  thrpt    6     4217,146 ±   172,781  ops/s
PerfTest.testSequentialFor       100  thrpt    6  3553930,640 ± 21142,150  ops/s
PerfTest.testSequentialFor      1000  thrpt    6   356217,937 ±  7446,137  ops/s
PerfTest.testSequentialFor     10000  thrpt    6    28894,748 ±   674,929  ops/s
PerfTest.testSequentialFor    100000  thrpt    6     1725,735 ±    13,273  ops/s

So the sequential version looks way faster up to a few thousand elements, they are on par with manual threading a bit before 10k, and with parallel stream a bit after, and threaded code performs better from there on.

Considering the amount of code needed to write the "manual threading variant", and the risk of creating a bug there or an inefficiency by badling calculating batch size, I'd probably not elect that option even if it looks like it can be faster than streams for huge lists.

I would not try to sort first, then binary search as filtering is a O(N) operation, and sorting a O(Nlog(N)) (on top of which you have to add a binary search). So unless you have a very precise pattern on the data I do not see it working at your advantage.

Be aware though not to draw conclusion this benchmark can not support. For one thing, it is based on the assumption that the filtering is the only thing happening in the program and fighting for CPU time. If you are in any kind "multi-user" application (e.g. web application), then this is probably not true, and you may very well lose everything you though you would gain by multithreading.

GDC · Answer 2 · 2018-09-11T12:26:59.027

2

Since Java 8, you can use streams to solve the problem in a few lines:

List<String> yourList = new ArrayList<>(); // A list whose size requires parallelism
String yourPrefix = ""; // Your custom prefix
final List<String> filteredList = yourList.parallelStream()
               .filter(s -> s.startsWith(yourPrefix))
               .collect(Collectors.toList());

I suggest you this reading and to look at this question to understand if parallelism will help you or not.

edited Sep 11 '18 at 12:26

answered Sep 11 '18 at 08:46

GDC

91
1
4

thanks for your answer. it takes more than the brout force method. Here the results: Time difference in Milliseconds with Brute-Force: 33.033602 Time difference in Milliseconds with Stream from Java 8: 65.017069 – Ahmad Al-Khazraji Sep 11 '18 at 09:19
@AhmadAl-Khazraji the difference is a bit surprising (surprisingly high). Could you edit your question with the code you have so far for the two implementations (or a simplified version of it) ? – GPI Sep 11 '18 at 09:27

Prefix search Algorithim using multi-cores

2 Answers2

Linked