Traversing a Binary Tree with multiple threads

Question

So I'm working on a speed contest in Java. I have (number of processors) threads doing work, and they all need to add to a binary tree. Originally I just used a synchronized add method, but I wanted to make it so threads could follow each other through the tree (each thread only has the lock on the object it's accessing). Unfortunately, even for a very large file (48,000 lines), my new binary tree is slower than the old one. I assume this is because I'm getting and releasing a lock every time I move in the tree. Is this the best way to do this or is there a better way?

Each node has a ReentrantLock named lock, and getLock() and releaseLock() just call lock.lock() and lock.unlock();

My code:

public void add(String sortedWord, String word) {

    synchronized(this){
        if (head == null) {
            head = new TreeNode(sortedWord, word);
            return;
        }
        head.getLock();
    }

    TreeNode current = head, previous = null;
    while (current != null) {

        // If this is an anagram of another word in the list..
        if (current.getSortedWord().equals(sortedWord)) {
            current.add(word);
            current.releaseLock();
            return;
        }
        // New word is less than current word
        else if (current.compareTo(sortedWord) > 0) {
            previous = current;
            current = current.getLeft();
            if(current != null){
                current.getLock();
                previous.releaseLock();
            }
        }
        // New word greater than current word
        else {
            previous = current;
            current = current.getRight();
            if(current != null){
                current.getLock();
                previous.releaseLock();
            }
        }
    }

    if (previous.compareTo(sortedWord) > 0) {
        previous.setLeft(sortedWord, word);
    }
    else {
        previous.setRight(sortedWord, word);
    }
    previous.releaseLock();
}

EDIT: Just to clarify, my code is structured like this: The main thread reads input from a file and adds the words to a queue, each worker thread pull words from the queue and does some work (including sorting them and adding them to the binary tree).

One small suggestion: instead of NumberOfProcessors threads, you might want to subtract one - because the OS will use one at least, defining a thread per processor pretty much guarantees some context-swapping overhead. — CPerkins, Dec 03 '09 at 21:12
I figured it would be fine because it's not unreasonable to assume that the threads will be waiting on each other a lot. — Brendan Long, Dec 03 '09 at 21:35
If they're waiting on each other, that's not good, because they're not working. — Mike Dunlavey, Dec 03 '09 at 22:35
That's the point. They are waiting on each other while adding to this list. Even using fancy locking, they still spend most of their time waiting on eachother. — Brendan Long, Dec 04 '09 at 00:08
There is some very wrong terminology in the question and most answers: a B-Tree is NOT a binary tree, it's a far more complex data structure: http://en.wikipedia.org/wiki/B-tree — Michael Borgwardt, Dec 04 '09 at 09:42

Stephan Eggermont · Answer 1 · 2009-12-04T09:32:40.940

Another thing. There definitely is no place for a binary tree in performance critical code. The cacheing behaviour will kill all performance. It should have a much larger fan out (one cache line) [edit] With a binary tree you access too much non-contiguous memory. Take a look at the material on Judy trees.

And you probably want to start with a radix of at least one character before starting the tree.

And do the compare on an int key instead of a string first.

And perhaps look at tries

And getting rid of all the threads and synchronization. Just try to make the problem memory access bound

[edit] I would do this a bit different. I would use a thread for each first character of the string, and give them their own BTree (or perhaps a Trie). I'd put a non-blocking work queue in each thread and fill them based on the first character of the string. You can get even more performance by presorting the add queue and doing a merge sort into the BTree. In the BTree, I'd use int keys representing the first 4 characters, only refering to the strings in the leaf pages.

In a speed contest, you hope to be memory access bound, and therefore have no use for threads. If not, you're still doing too much processing per string.

I thought we were talking about B-Trees which are a lot different from binary search trees (or general binary trees). — pmr, Dec 03 '09 at 23:16
I looked it up and the only difference I saw was that binary trees are B-Trees with only two child nodes.. — Brendan Long, Dec 04 '09 at 00:09
And they don't perform on modern hardware (built the last ten years). Look at the details, they matter in a speed contest — Stephan Eggermont, Dec 04 '09 at 09:19
http://softwareengineering.vazexqi.com/2009/11/23/lessons-from-parallelizing-matrix-multiplication — Stephan Eggermont, Dec 04 '09 at 13:10

Erik · Answer 2 · 2009-12-04T14:52:12.003

I would actually start looking at the use of compare() and equals() and see if something can be improved there. You might wrap you String object in another class with an different, optimized for your usecase, compare() method. For instance, consider using hashCode() instead of equals(). The hashcode is cached so future calls will be that much faster. Consider interning the strings. I don't know if the vm will accept that many strings but it's worth checking out.

(this was going to be a comment to an answer but got too wordy).

When reading the nodes you need to get a read-lock for each node as you reach it. If you read-lock the whole tree then you gain nothing. Once you reach the node you want to modify, you release the read lock for that node and try to acquire the write lock. Code would be something like:

TreeNode current; // add a ReentrantReadWriteLock to each node.

// enter the current node:
current.getLock().readLock().lock();
if (isTheRightPlace(current) {
current.getLock().readLock().unlock();
current.getLock().writeLock().lock(); // NB: getLock returns a ConcurrentRWLock
// do stuff then release lock
current.getLock().writeLock().unlock();
} else {
current.getLock().readLock().unlock();
}

I tried code similar to that one but it slowed it down significantly. The answer that said to read lock the whole tree and then write lock the whole tree when I need to add actually worked pretty well (it always gave the correct answer, even with 4 threads and a list of 479,000 words). Fixing the compareTo() method sounds like a good idea.. — Brendan Long, Dec 04 '09 at 19:38

score 2 · Answer 3 · answered Dec 03 '09 at 20:32

Locking and unlocking is overhead, and the more you do it, the slower your program will be.

On the other hand, decomposing a task and running portions in parallel will make your program complete more quickly.

Where the "break-even" point lies is highly-dependent on the amount of contention for a particular lock in your program, and the system architecture on which the program is run. If there is little contention (as there appears to be in this program) and many processors, this might be a good approach. However, as the number of threads decreases, the overhead will dominate and a concurrent program will be slower. You have to profile your program on the target platform to determine this.

Another option to consider is a non-locking approach using immutable structures. Rather than modifying a list, for example, you could append the old (linked) list to a new node, then with a compareAndSet operation on an AtomicReference, ensure that you won the data race to set the words collection in current tree node. If not, try again. You could use AtomicReferences for the left and right children in your tree nodes too. Whether this is faster or not, again, would have to be tested on your target platform.

I like the AtomicReference idea, but I think I've gone enough over the top already. — Brendan Long, Dec 03 '09 at 22:32

score 2 · Accepted Answer · edited May 23 '17 at 11:58

You may try using an upgradeable read/write-lock (maybe its called an upgradeable shared lock or the like, I do not know what Java provides): use a single RWLock for the whole tree. Before traversing the B-Tree, you acquire the read (shared) lock and you release it when done (one acquire and one release in the add method, not more).

At the point where you have to modify the B-Tree, you acquire the write (exclusive) lock (or "upgrade" from read to write lock), insert the node and downgrade to read (shared) lock.

With this technique the synchronization for checking and inserting the head node can also be removed!

It should look somehow like this:

public void add(String sortedWord, String word) {

    lock.read();

    if (head == null) {
        lock.upgrade();
        head = new TreeNode(sortedWord, word);
        lock.downgrade();
        lock.unlock();
        return;
    }

    TreeNode current = head, previous = null;
    while (current != null) {

            if (current.getSortedWord().equals(sortedWord)) {
                    lock.upgrade();
                    current.add(word);
                    lock.downgrade();
                    lock.unlock();
                    return;
            }

            .. more tree traversal, do not touch the lock here ..
            ...

    }

    if (previous.compareTo(sortedWord) > 0) {
        lock.upgrade();
        previous.setLeft(sortedWord, word);
        lock.downgrade();
    }
    else {
        lock.upgrade();
        previous.setRight(sortedWord, word);
        lock.downgrade();
    }

    lock.unlock();
}

Unfortunately, after some googling I could not find a suitable "ugradeable" rwlock for Java. "class ReentrantReadWriteLock" is not upgradeable, however, instead of upgrade you can unlock read, then lock write, and (very important): re-check the condition that lead to these lines again (e.g. if( current.getSortedWord().equals(sortedWord) ) {...}). This is important, because another thread may have changed things between read unlock and write lock.

for details check this question and its answers

In the end the traversal of the B-tree will run in parallel. Only when a target node is found, the thread acquires an exclusive lock (and other threads will block only for the time of the insertion).

Would it be safe to do `synchronized(OtherLock){ this.readLock().unlock(); this.writeLock().lock(); }` without rechecking my conditions? — Brendan Long, Dec 04 '09 at 00:38
I switched it to having an Object named writeLock and doing: `synchronized(writeLock){ do_things(); this.readLock().unlock(); return; }` and it seems to work :) — Brendan Long, Dec 04 '09 at 01:41
This is not correct. Even thought the java interface (class ReadWriteLock or similar) seems to have two independent locks, thats not the case. Read and Write Locks are no separate things!!! In your case read locks may still be active in the synchronized section. But the concept is that when acquiring a write lock, NO readers can be active. When acquiring the write lock it waits until all read locks were released. — Frunsi, Dec 04 '09 at 02:34
Looks like the only safe way is (whenever you need write): `rwl.getReadLock().unlock(); /* short-unlocked-time-here: other threads may modify tree now */ rwl.getWriteLock().lock(); if( condition is still true ) { .. do write .. has_written=true; } rwl.writeLock.unlock(); if( has_written ) return; else ...` — Frunsi, Dec 04 '09 at 02:41
I realize my `synchronized(writeLock){ do_things(); this.readLock().unlock(); return; }` code doesn't work (the output was terrible, but wouldn't synchronized(OtherLock){ this.readLock().unlock(); this.writeLock().lock(); do_things(); this.writeLock().unlock(); } work? Assuming I insure that every thread has to lock otherLock before getting writeLock()? — Brendan Long, Dec 04 '09 at 05:03
Not sure, but this seems to be dangerous. Maybe you should just simplify the add() (remove all the locks and make the method synchronized). Then profile your code, check where the real bottlenecks are. Insertion into a B-Tree is not well suited for parallelization. — Frunsi, Dec 04 '09 at 11:32

score 1 · Answer 5 · answered Dec 03 '09 at 20:22

1

Considering one dataset per line, 48k lines isn't all that much and you can only have wild guesses as to how your operating system and the virtual machine are going to mangle you file IO to make it as fast as possible.

Trying to use a producer/consumer paradigm can be problematically here as you have to balance the overhead of locks vs. the actual amount of IO carefully. You might get better performance if you just try to improve the way you do the File IO (consider something like mmap()).

answered Dec 03 '09 at 20:22

pmr

58,701
10
113
156

That's a good point. I tried it on /usr/share/dict/words (470,000 words) and the fancy locking was still twice as slow as just using a synchronized method. Thanks for the idea. File IO isn't really the problem because reading the entire file only takes ~1 second (I have another class for shuffling files and it's quite fast). – Brendan Long Dec 03 '09 at 22:23

score 1 · Answer 6 · answered Dec 03 '09 at 20:27

I would say that the doing it this way is not the way to go, without even taking the synchronization performance issues into account.

The fact that this implementation is slower than the original fully synchronized version may be a problem, but a bigger problem is that the locking in this implementation is not at all robust.

Imagine, for example, that you pass null in for sortedWord; this will result in a NullPointerException being thrown, which will mean you end up with holding onto the lock in the current thread, and therefore leaving your data structure in an inconsistent state. On the other hand, if you just synchronize this method, you don't have to worry about such things. Considering the synchronized version is faster as well, it's an easy choice to make.

It's a *speed contest,* not production code. That is a big factor in selecting the "right" approach. — erickson, Dec 03 '09 at 20:28
The thing is, I can guarantee based on my other code that it will never ever be passed null. In the function that calls the b-tree's add() function, it actually checks if(sortedWord == null) right before calling this. It's not meant to be particularly safe, but my program always works. — Brendan Long, Dec 03 '09 at 21:36

score 1 · Answer 7 · answered Dec 04 '09 at 15:12

1

You seem to have implemented a Binary Search Tree, not a B-Tree.

Anyway, have you considered using a ConcurrentSkipListMap? This is an ordered data structure (introduced in Java 6), which should have good concurrency.

answered Dec 04 '09 at 15:12

Neil

1,754
2
17
30

score 0 · Answer 8 · edited May 23 '17 at 11:58

0

I've got a dumb question: since you're reading and modifying a file, you're going to be totally limited by how fast the read/write head can move around and the disk can rotate. So what good is it to use threads and processors? The disc can't do two things at once.

Or is this all in RAM?

ADDED: OK, It's not clear to me how much parallelism can help you here (some, maybe), but regardless, what I would suggest is squeezing every cycle out of each thread that you can. This is what I'm talking about. For example, I wonder if innocent-looking sleeper code like those calls to "get" and "compare" methods are taking a more of a % of time than you might expect. If they are, you might be able to do each of them once rather than 2 or 3 times - that sort of thing.

edited May 23 '17 at 11:58

Community

1
1

answered Dec 03 '09 at 22:12

Mike Dunlavey

40,059
14
91
135

The main thread is reading in input and adding it to a queue, and then there are (number of processors) threads pulling data from the queue. The B-Tree is only one thing the worker threads have to do, so we're not waiting on IO. – Brendan Long Dec 03 '09 at 22:25
OK, thanks for that clarification. Now, what I would do is see what each thread is typically waiting for, by taking stackshots. That way you can see how much time is spent in I/O, how much in synchronization, and if you're lucky, in other stuff that you don't really need to do. – Mike Dunlavey Dec 03 '09 at 22:31
Getting rid of all the threads sounds like the right thing. It should be memory access bound – Stephan Eggermont Dec 03 '09 at 22:34
I was actually just talking to someone about the threads, and apparently turning them off doesn't make a major difference, but it's still a noticeable improvement, especially with very large sets of data. – Brendan Long Dec 04 '09 at 00:11

Traversing a Binary Tree with multiple threads

8 Answers8