Thread safety in huge data model

Question

Background: I have got a (more or less) huge data model in the memory. The model contains around 3.150.000 to 12.600.000 objects that could be modified directly. In addition, there are around 75.000.000 objects that can only be modified via those 3.150.000 to 12.600.000 objects.

On the other hand, there are around 10 modules which accessing the model. These modules can be grouped into:

reading and modifying some of the objects every 250 ms to 1000 ms
reading and modifying some of the objects on demand
reading the some of the objects if they have been changed

Question: How to synchronize such a data model? There are the following ideas in my mind:

(1) A lock in every class that can be directly modified. Advantage: Only the objects that are modified must be locked. Disadvantage: A high synchronization effort and a huge amount of lock instances (3.150.000 to 12.600.000 additional objects/locks). There is a great danger of doing something wrong in the synchronization (deadlocks, etc.).

(2) A central interface to access the whole data model. This interface would lock the whole model on every modification via a single lock. Advantage: Only a single lock --> less synchronization effort. Disadvantage: The whole model is locked regardless of the type of change.

(3) Dispatch Thread (like in AWT/Swing). A thread which processes tasks (events). Advantage / disadvantage like idea (2). However, this would be a event based solutuion. I read Graham Hamilton's article about multi-threading in GUI-tollkits. In addition, there is a great talk about "Events versus Threads" by John Ousterhout. Of course my data model isn't that extensive, but the article gets to the heart of the matter.

Here the link to Graham Hamilton's article: https://weblogs.java.net/blog/kgh/archive/2004/10/multithreaded_t.html

So, what are your experiences? Maybe you have a better idea.

EDIT: I made a big mistake on the object calculation. I just updated the amount.

Thanks in advance :)

EDIT 2: Here a model I just created for demonstration purposes:

enum Ware { WOOD, COAL, STONE }
class Stock { Map<Ware, Integer> internalStock; }
class Coordinate { int x; int y; }
interface ILand {}

class World {
  Map<Coordinate, ILand> land;
  Map<Coordinate, Ship> ships;
}

class Island implements ILand { Stock resources; }
class Ship { Stock stock; }
class Building {Stock stock; }

class Colony implements ILand {
  Island builtOn;
  Set<Building> building;
}

class Character {
  Set<Colony> colonies;
  Set<Ship> fleet;
}

This would be the strucure of the data model:

Model
   World     <>--- ILand
             <>--- Ship
   Character <>--- Colony <>--- Building <>--- Stock
                          <>--- Island   <>---Stock
             <>--- Ship   <>--- Stock

How many threads? How many objects touched every 250-1000ms? — brettw, Nov 16 '13 at 09:19
Regarding (1) In my experience, the danger of deadlocks is mostly overvalued. From what I read above, I would not expect too much problems. However, of course, it depends on your exact scenario, especially whether or not multiple objects need to be locked simultaneously to achieve a certain task. Hard to give specific advice without knowing more details. For (3) I would question if it makes sense to decouple UI and model, e.g. by means of a work item queue or the like. Accessing the model could be still multithread, but the UI doesn't need to be so you avoid problems in that area. — JensG, Nov 16 '13 at 09:55
In the worst case all objects are modified directly. In average I expect around 400.000 objects to be directly modified every 250-1000ms (would needs to be locked using idea (1)). I expect 1 thread per module --> 10 threads. — QStormDS, Nov 16 '13 at 10:04
Has anyone experiances with such a huge amount of locks? How is the performance? — QStormDS, Nov 16 '13 at 10:07
You'd probably want _per-object_ locks, not per-class locks (unless you're doing something with the class-object). Note that you're probably going to want the ability to dispatch events for 'on modification' modules. — Clockwork-Muse, Nov 16 '13 at 10:16
Follow-up question, of the objects being updated in a given time period (let's say 1 second), how many are likely to be the *same* objects? Are mostly different objects updated, or is there a lot of contention? You may be able to get away with larger-grain locks (class or object) on uncontended objects, and finer grain locks on contended objects. — brettw, Nov 16 '13 at 14:45

score 2 · Answer 1 · edited Nov 16 '13 at 16:41

2

You may want to consider making your data model into an immutable persistent data structure.

This approach is used to very good effect in languages like Scala and Clojure. The following video is well worth watching if you want to understand this technique better:

http://www.infoq.com/presentations/Value-Identity-State-Rich-Hickey

This is often a good strategy when you have significant concurrent access: it has various advantages:

Readers don't require any locking. This can be a big performance win in situations where there are many readers.
Update can happen atomically - this means that you never run the risk of readers seeing the data in an inconsistent state.
You can take a "snapshot" of the whole data structure at any time. Since the immutable data structure can't change, you are free to get a reference to it and then examine it at leisure
Updates are still relatively cheap: structural sharing means that you can make a new version of the data model with a few changes without copying the whole data model.
You can have various different update semantics that suit your requirements. In this case, it sounds like you have a "read and update" semantic coupled with some form of change notification.

edited Nov 16 '13 at 16:41

toto2

5,306
21
24

answered Nov 16 '13 at 11:13

mikera

105,238
25
256
415

Thanks a lot for your answer. I watched the video and I think I understood the idea of immutable objects. However, I can not imagine how concurrent modification would work. Can multiple threads modify the model at the same time? If yes: How to avoid parallel versions of the data model? Do you know a sample project (in java) that shows the technique? – QStormDS Nov 16 '13 at 13:42
BTW: I just added a example model. – QStormDS Nov 16 '13 at 13:47
After thinking about it there are still two questions. 1: What do I need to do if I add a new Building to a Colony (see my example model)? Do I need to copy the whole reference structure (relationships) into a new model and add the new Building? 2: Did I understand it correctly, that only one thread can create a new version of the data model at the same time? If not: How to avoid parallel versions, if more than one thread can create a new version of the model at the same time? – QStormDS Nov 16 '13 at 15:49
@QStormDS see for example [CopyOnWriteArrayList](http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/CopyOnWriteArrayList.html). So that is a "yes" on question 1), but that class makes the copy for you. For question 2), if two threads try to modify the same component at the same time, only one will get the lock first (there is a write-lock on CopyOnWriteArrayList) and the other will modify it after the first change. – toto2 Nov 16 '13 at 16:47
However, another thread that only reads and iterates on the CopyOnWriteArrayList might iterate on "old" data. It is not quite a bug since it allows for much better performance. But maybe you can't allow that behavior for your problem. – toto2 Nov 16 '13 at 16:49
@toto2: Copy the whole reference structure (relationships) might be easy in case of a collection (or homogen data structure). However, how should that work in the example model (see my original post)? Several tasks (put Ware into Stock or add Colony to Character) would need to have more or less redundant copy procedures. – QStormDS Nov 16 '13 at 17:43
You would not have to do anything special if you use CopyOnWriteArrayList for each set of members in any item. If you add one member somewhere in the hierarchy, only one CopyOnWriteArrayList is affected and the higher ups in the hierarchy do not change. I'm sure CopyOnWriteArrayList forces a cache synchronization so all copies of the data model on all threads would get the update. – toto2 Nov 16 '13 at 18:10
@QStormDS hey, I've updated my post for your data model. hope it helps! – Andrey Chaschev Nov 16 '13 at 18:54
I did some testing with the CopyOnWriteArrayList and the performance is very low and the memeory usage extremly high. – QStormDS Nov 16 '13 at 21:08
@mikera: Can you please say something about bullet item 4 ("you can make a new version of the data model")? How could I do this after adding a Building to a Colony (see my data model). How to re-create the whole data model? There are several Character, thousands of Colonies and even more Buildings in the data model. – QStormDS Nov 16 '13 at 21:28
@QStormDS The performance of CopyOnWriteArrayList is obviously poor on write. Be careful how you test performance: you should simulate the proportion of reads/writes that you would get in real conditions. – toto2 Nov 17 '13 at 13:55
CopyOnWriteArrayList does not provide *structural sharing* (it creates new copies of the whole structure on write), so it isn't really suitable for this kind of technique. You need something like Clojure's persistent vectors (as described in the linked video). See also: http://en.wikipedia.org/wiki/Persistent_data_structure – mikera Nov 17 '13 at 14:09

score 1 · Answer 2 · edited May 23 '17 at 11:56

a) * This is solution which was used in a real social game * If you can think of a key for your objects or proper equals/hashCode functions, you can put them into ConcurrentHashMap. Each present entity in this map would mean locked state for an object. This will result into 40 bytes per entity overhead.

b) You can optimize previous solution and come up with another hash function which would split all your objects into buckets of some reasonable size, i.e. 100 elements (you could measure the needed amount by running tests). In this case a whole bucket would be locked and it would save you some extra bytes. This would result into ~12 bytes overhead per entity to store elements in buckets (i.e. in an ArrayList).

c) Third option, AtomicBitSet implementation for java. This is a modification of a second approach. Buckets can be locked via a compact atomic set. This would be a little better than the second option, and the advantage of this one is that you can have smaller buckets as they require less memory (~40 bytes per bucket in a ConcurrentHashMap vs a couple of bits per bucket in an AtomicBitSet).

Locks

A state might be more complex than just locked/not locked. So instead of maintaining a map:

 lock map: objectId -> {true | false}

Or

 lock map: bucket of objectIds -> {true | false}

One could store lock information:

 lock map: objectId -> {ReadWriteLock lock, Thread owner, long writeLockGrantedAtMs}

If there is no object in this map, then no one locks it. In other case, the object is locked with a lock strategy described by ReadWriteLock. writeLockedAtMs could be used to interrupt the owner if he's holding it for too long.

ADDED

I'm not sure you need this, but deadlocks can be completely avoided if you implement atomic lock for your entities and re-order them i.e. by a hashCode when locking. This can be done by sequentially applying locks to each of the objects with a timeout. Simplified pseudocode:

void lockObjects (f, e, a) {

    reorder (f, e, a)

    if(!tryLock(a, timeout: 10ms)){
        throw "could not lock a";
    }

    if(!tryLock(e, timeout: 10ms)){
        throw "could not lock e";
    }

    if(!tryLock(f, timeout: 10ms)){
        throw "could not lock f";
    }

    // now these objects are locked, deadlocks avoided
}

UPDATE for the data model

I actually implemented structure a) for 3 social games which were running in production for 1-2 years. The resulting solution was a bit more complex and included persistence, monitoring and deadlock resolving, but this was a requirement and is not very needed.

For example, if you want to add a Colony to a Character, you would make a lock for a character. And you should make sure you always lock your object / there is no other way to get your object than by obtaining a lock.

If you want to add a Colony to six Characters, you could do this non-atomically, i.e. sequentially add Colony to each Character (each addition being atomic) or implement atomic lock and lock all seven objects. The difference can be noticed if there are some problems with locks - in the first case you would get a bigger delay, in the second case you might get a partial update.

Hey, thanks a lot (also for updating for my data model). I will give it a trial. — QStormDS, Nov 16 '13 at 20:16
One further question: How to ensure read lock with this solution? Imagine I would like to display the Character details, Colonies and the Buildings in a GUI. In that case I would need to look for each object if it is locked. But how to react if an object is locked? Example: Character ch = model.getPlayer(); for (Colony col : ch.getColonies()) { // how to handle locks here? } — QStormDS, Nov 16 '13 at 21:32
Added 'Locks' section - basically you can just store a `ReadWriteLock` for each of your *locked* objects. — Andrey Chaschev, Nov 16 '13 at 21:54

score 0 · Answer 3 · answered Nov 16 '13 at 14:59

0

Write some unit tests. For the model, start with high-level locks, like synchronized methods. If some of the methods are just modifying maps, but not objects within, consider using ConcurrentHashMaps for those instead of synchronized methods.

answered Nov 16 '13 at 14:59

brettw

10,664
2
42
59

score 0 · Answer 4 · answered Nov 16 '13 at 17:01

Variants 2 and 3 reduce parallelism to zero (only one thread can access data each moment). Variant 1 maximizes parallelism (indeed one lock per object, not per class). Synchronization efforts are minimal in this case (low contention). The memory for locks is shadowed by existence of 75 millions of background objects. Deadlocks can (and should) be avoided by careful synchronization scheduling (avoid cycles on resource graph).

Trying · Answer 5 · 2013-11-16T19:11:27.793

I would say first very important point is to make your classes immutable as much as possible. With immutable thread safety comes free. You can share your object with out synchronization, JMM will take care of safe publication i.e. you can over come the visibility issue without doing synchronization.
As soon as you start thinking of immutable objects think about Flyweight pattern which help you in reducing the memory foot print in object creation and also with this performance also increases. Because as you know your classes are immutable and only one object is present of one type you can cache lot of information like hashCode of an object and also can be calculated lazily.
Instead of using synchronization block you can go for tryLock from Lock interface, which helps you in preventing the deadlock.
You can use ReadWriteLock also. For read use Read Lock and for write use write lock.
You can use global ordering to get rid of deadlock. Suppose you have a method like this:

public void fun(MyClass1 o1, MyClass2 o2)
{
    synchronized(o1)
      {
        synchronized(o2){
            .........
            .........
            .........
      }
}

here there is a very well possibility of deadlock so here you can use global lock ordering like:

public void fun(MyClass1 o1, MyClass2 o2)
{
    long l1 = System.identityHashCode(o1);
    long l2 = System.identityHashCode(o2);
    if(l1>l2){
      synchronized(o1)
      {
        synchronized(o2)
        {
            .........
            .........
            .........
       }
    }
   }
   else if (l1<l2)
   {
      synchronized(o2)
      {
        synchronized(o1)
        {
            .........
            .........
            .........
       }
    }
   }
 else//if equal than resolve by another mutex
 {
     synchronize(o)
     {
          synchronize(o1)
          {
              synchronize(o2)
              {
                  .........
                  .........
                  .........

              }
          }
      }
 }
}

Calculate identityHashCode if long1 > long2 then one type of ordering, if long2 > long1 than another type of ordering and if there is a tie then use a mutex o to resolve. This helps in great deal in avoiding deadlocks.

you can use concurrent data structure like ConcurrentLinkedQueue which uses compareAndSwap machine instruction which are non-blocking and lock free as well.
consider using ConcurrentHashMap which gives better scalability in multithreading environment as it uses lock striping.
While using Lock interface make sure you use non-fair lock as it scales well and gives better performance in real world. Reason: suppose you think a thread A just finishes a job and then the next thread B is ready to be given the CPU. Before it is given the cpu the data should be brought to cache, state should be changed and a lot of things to do. If at that time a thread C comes for cpu and if you have non fair lock than the new thread C will be given the cpu and it increases the overall system performance. Because till the time B has finished waking up it may be possible this new thread C would have finished it's task. It is win for all.

Thread safety in huge data model

5 Answers5