How to sort and chunk a dependency tree of actions, so you can batch as many actions as possible together at each step?

Question

Say you have a bunch of actions for creating/inserting records into a bunch of different database tables. You have some records which can be inserted without any dependency on the output of any other insert. You have some which need to wait for one other thing to finish. And you have others that need to wait for many things to finish, which might finish at different times in the flow.

How can you write an algorithm which would sort and chunk the actions in the dependency tree so the inserts / database actions can be optimally batched? By optimally batched, I mean if you can insert 10 records into the same table at once, then do that. Any time you can batch insert, you should, to minimize the number of database calls/inserts.

Here is a snippet from the example code I whipped together with a fake sequence of actions using a simple data structure to capture all the required information.

...
{ action: 'create', table: 'tb1', set: 'key12', input: {
  p: { type: 'binding', path: ['key10', 'z'] },
  q: { type: 'binding', path: ['key11', 'a'] }
} },
{ action: 'create', table: 'tb4', set: 'key13' },
{ action: 'create', table: 'tb3', set: 'key14' },
{ action: 'create', table: 'tb4', set: 'key15', input: {
  a: { type: 'binding', path: ['key8', 'z'] },
} },
...

Note that there are 4 possible properties on our "dependency node item":

action: This is always "create" in our situation, but potetntially could be other things in the future.
table: The table name to insert into.
set: The name of the variable to add to the shared global "scope" in the action dependency tree, so other actions can read this as input.
input: Inputs to the action, which in our case are all "binding" inputs (but could just as well be literal values, but that would be too easy). For binding inputs, it reads some property/column value from a record stored in the shared scope of the dependency tree.

Given that, it should be possible somehow to construct a simple algorithm that chunks the actions into subsets which can be run in parallel AND batched. For example, our code below would end up something like this structure (I manually created this output, so there could be mistakes though I think I got it right. Oh and note, while the tables are numbered, it doesn't imply an order to them necessarily, just were simple names to pick):

// this is "close to" the desired result, because
// there might be a slightly different sort applied
// to this when implemented, but the chunking pattern
// and grouping of elements should be exactly like This
// (I am pretty sure, I manually did this 
// so there could be small mistakes, but I doubt it)
const closeToDesiredResult = [
  [
    [
      { action: 'create', table: 'tb1', set: 'key1' },
      { action: 'create', table: 'tb1', set: 'key21' },
    ],
    [
      { action: 'create', table: 'tb2', set: 'key2' },
      { action: 'create', table: 'tb2', set: 'key3' },
      { action: 'create', table: 'tb2', set: 'key23' },
    ],
    [
      { action: 'create', table: 'tb4', set: 'key6' },
      { action: 'create', table: 'tb4', set: 'key8' },
      { action: 'create', table: 'tb4', set: 'key13' },
    ],
    [
      { action: 'create', table: 'tb3', set: 'key5' },
      { action: 'create', table: 'tb3', set: 'key7' },
      { action: 'create', table: 'tb3', set: 'key9' },
      { action: 'create', table: 'tb3', set: 'key14' },
      { action: 'create', table: 'tb3', set: 'key24' },
    ],
    [
      { action: 'create', table: 'tb6', set: 'key17' },
    ],
    [
      { action: 'create', table: 'tb5', set: 'key16' },
    ]
  ],
  [
    [
      { action: 'create', table: 'tb1', set: 'key4', input: {
        x: { type: 'binding', path: ['key2', 'baz'] }
      } },
    ],
    [
      { action: 'create', table: 'tb3', set: 'key10', input: {
        y: { type: 'binding', path: ['key6', 'foo'] },
        z: { type: 'binding', path: ['key1', 'bar'] }
      } },
    ],
    [
      { action: 'create', table: 'tb4', set: 'key15', input: {
        a: { type: 'binding', path: ['key8', 'z'] },
      } },
    ]
  ],
  [
    [
      { action: 'create', table: 'tb1', set: 'key12', input: {
        p: { type: 'binding', path: ['key10', 'z'] },
        q: { type: 'binding', path: ['key11', 'a'] }
      } },
    ],
    [
      { action: 'create', table: 'tb4', set: 'key11', input: {
        a: { type: 'binding', path: ['key10', 'z'] },
        b: { type: 'binding', path: ['key1', 'bar'] }
      } },
    ],
    [
      { action: 'create', table: 'tb6', set: 'key18', input: {
        m: { type: 'binding', path: ['key4', 'x'] },
      } },
      { action: 'create', table: 'tb6', set: 'key19', input: {
        m: { type: 'binding', path: ['key4', 'x'] },
        n: { type: 'binding', path: ['key13', 'a'] },
      } },
    ]
  ],
  [
    [
      { action: 'create', table: 'tb2', set: 'key22', input: {
        w: { type: 'binding', path: ['key18', 'm'] },
        x: { type: 'binding', path: ['key17', 'm'] },
      } },
    ],
    [
      { action: 'create', table: 'tb6', set: 'key20', input: {
        m: { type: 'binding', path: ['key18', 'm'] },
        n: { type: 'binding', path: ['key17', 'm'] },
      } },
    ]
  ]
]

Notice how there are 4 top-level chunks in the resulting array. These are the main steps. Then within each step, everything is grouped by table, so they can all be run in parallel, and within each table group, they can all be batch inserted. Boom.

How would you implement this, it seems quite tricky for my mind to grasp?

const actionTree = generateActionTree()
const chunkedActionTree = chunkDependencyTree(actionTree)

function chunkDependencyTree(list) {
  const independentOnesMapByTableName = {}
  list.forEach(node => {
    // easy case
    if (!node.input) {
      const group = independentOnesMapByTableName[node.table]
        = independentOnesMapByTableName[node.table] ?? []
      group.push(node)
    } else {
      // I am at a loss for words...
    }
  })
}

function generateActionTree() {
  // this would be constructed through a bunch of real-world
  // functions, queuing up all the actions
  // and pointing outputs to inputs.
  return [
    { action: 'create', table: 'tb1', set: 'key1' },
    { action: 'create', table: 'tb2', set: 'key2' },
    { action: 'create', table: 'tb2', set: 'key3' },
    { action: 'create', table: 'tb3', set: 'key5' },
    { action: 'create', table: 'tb4', set: 'key6' },
    { action: 'create', table: 'tb3', set: 'key7' },
    { action: 'create', table: 'tb4', set: 'key8' },
    { action: 'create', table: 'tb3', set: 'key9' },
    { action: 'create', table: 'tb3', set: 'key10', input: {
      y: { type: 'binding', path: ['key6', 'foo'] },
      z: { type: 'binding', path: ['key1', 'bar'] }
    } },
    { action: 'create', table: 'tb1', set: 'key4', input: {
      x: { type: 'binding', path: ['key2', 'baz'] }
    } },
    { action: 'create', table: 'tb4', set: 'key11', input: {
      a: { type: 'binding', path: ['key10', 'z'] },
      b: { type: 'binding', path: ['key1', 'bar'] }
    } },
    { action: 'create', table: 'tb1', set: 'key12', input: {
      p: { type: 'binding', path: ['key10', 'z'] },
      q: { type: 'binding', path: ['key11', 'a'] }
    } },
    { action: 'create', table: 'tb4', set: 'key13' },
    { action: 'create', table: 'tb3', set: 'key14' },
    { action: 'create', table: 'tb4', set: 'key15', input: {
      a: { type: 'binding', path: ['key8', 'z'] },
    } },
    { action: 'create', table: 'tb5', set: 'key16' },
    { action: 'create', table: 'tb6', set: 'key17' },
    { action: 'create', table: 'tb6', set: 'key18', input: {
      m: { type: 'binding', path: ['key4', 'x'] },
    } },
    { action: 'create', table: 'tb6', set: 'key19', input: {
      m: { type: 'binding', path: ['key4', 'x'] },
      n: { type: 'binding', path: ['key13', 'a'] },
    } },
    { action: 'create', table: 'tb6', set: 'key20', input: {
      m: { type: 'binding', path: ['key18', 'm'] },
      n: { type: 'binding', path: ['key17', 'm'] },
    } },
    { action: 'create', table: 'tb1', set: 'key21' },
    { action: 'create', table: 'tb2', set: 'key22', input: {
      w: { type: 'binding', path: ['key18', 'm'] },
      x: { type: 'binding', path: ['key17', 'm'] },
    } },
    { action: 'create', table: 'tb2', set: 'key23' },
    { action: 'create', table: 'tb3', set: 'key24' },
  ]
}

I think this is roughly topological sorting, but not quite sure how to apply it to this specific situation.

@Gene it's hard for me to imagine in my head how this would work, can you demonstrate? — Lance, Jan 19 '22 at 01:39

brunoff · Answer 1 · 2022-01-20T13:20:41.137

We have this:

const actionTree = [
    { action: 'create', table: 'tb1', set: 'key1' },
    { action: 'create', table: 'tb2', set: 'key2' },
...
    { action: 'create', table: 'tb3', set: 'key24' },
  ];

We want to fill this:

batches = [];

Assuming that we only must be sure that all dependent input sets had already been inserted, and that the second item of input's path: ['key8', 'z'] (z in this case) does not affect anything because sets are atomic, all we have to do is:

batches = [];

batched = () => batches.reduce((p,a)=>[...p,...a],[]);
unbatched = () => actionTree.filter(b=>batched().indexOf(b)<0);

nextbatchfilter = (a) => (!("input" in a))||(Object.values(a.input).filter(i=>batched().map(a=>a.set).indexOf(i.path[0])<0).length==0);

while (unbatched().length>0)
    batches.push(unbatched().filter(nextbatchfilter));
    if (batches[batches.length-1].length==0) {
        console.log("could not build dependency graph with all items, "+unbatched().length.toString()+" items remaining")
        break; // avoid infinite loop in case of impossible tree
    }

Here batched() show which actions have been batched, by flattening out batches; whereas unbatched() is the opposite, it shows which actions still need to be batched. nextbachfilter shows from those not batched which ones can be batched. With those it is already possible to do our job.

It is possible to modify the code in order to reduce excessive cpu rework when calculating unbatched and batched, by leaving the intermediate state materialized:

batches = [];

unbatched = Array.from(actionTree);

nextbatchfilter = (a) => (!("input" in a))||(Object.values(a.input).filter(i=>!(batchedsets.has(i.path[0]))).length==0);

batchedsets = new Set();
while (unbatched.length>0) {
    nextbatch = unbatched.filter(nextbatchfilter);
    if (nextbatch.length==0) {
        console.log("could not build dependency graph with all items, "+unbatched.length.toString()+" items remaining")
        break; // avoid infinite loop in case of impossible tree
    }
    unbatched = unbatched.filter(a=>!nextbatchfilter(a));
    batches.push(nextbatch);
    nextbatch.forEach(a=>batchedsets.add(a.set));
}

In both cases, the batches output does not group action by tables, in order to view it grouped by table, like in the examples, all needed is:

batches.map(b=>Array.from(new Set(b.map(a=>a.table))).map(t=>b.filter(a=>a.table==t)));

Optionally it can be constructed with this group by already in place.

EDIT: added infinite loop protection for both solutions

Gene · Accepted Answer · 2022-01-19T05:50:27.570

Your data structure isn't clear to me. What are the single letter ids p, q, etc.? I also don't understand the role of tables. You can insert multiple items in the same table in one write, no? I'm assuming these tihngs don't matter in the root sequencing problem.

I'm treating the set field as a "job" and the corresponding keys mentioned in the inputs as dependencies: jobs that must be completed before it.

I don't have time to be thorough here, and I don't have a javascript environment handy, so this is in Python.

Let's extract a dependency graph from the the verbose data structure. Then look for "levels." The first level is all nodes with no dependencies. The second is all nodes with dependencies met in any previous level, etc. Rinse and repeate.

Note unlike I was thinking in my note in comments, this is not a level graph by the traditional definition.

Also, I'm not going to bother with data structures to make this efficient. You could do it in O(n log n) time. My code is O(n^2).

Sorry if I'm misinterpreting your question. Also sorry for untested, possibly buggy implementation here.

from collections import defaultdict

def ExtractGraph(cmds):
  """Gets a dependency graph from verbose command input data."""
  graph = defaultdict(set)
  for cmd in cmds:
    node = cmd['set']
    graph[node].update(set())
    inputs = cmd.get('input')
    if inputs:
      for _, val in inputs.items():
        graph[node].add(val['path'][0])
  return graph

def FindSources(graph):
  """Returns the nodes of the given graph having no dependencies."""
  sources = set()
  for node, edges in graph.items():
    if not edges:
      sources.add(node)
  return sources

def GetLevels(dependencies):
  """Returns sequence levels satisfying given dependency graph."""
  sources = FindSources(dependencies)
  level = set(sources)
  done = set(level)
  todos = dependencies.keys() - done
  levels = []
  while level:
    levels.append(level)
    # Next level is jobs that have all dependencies done
    new_level = set()
    # A clever data structure could find the next level in O(k log n)
    # for a level size of k and n jobs. This needs O(n).
    for todo in todos:
      if dependencies[todo].issubset(done):
        new_level.add(todo)
    todos.difference_update(new_level)
    done.update(new_level)
    level = new_level
  return levels

cmds = [
    { 'action' : 'create', 'table' : 'tb1', 'set' : 'key1' },
    { 'action' : 'create', 'table' : 'tb2', 'set' : 'key2' },
    { 'action' : 'create', 'table' : 'tb2', 'set' : 'key3' },
    { 'action' : 'create', 'table' : 'tb3', 'set' : 'key5' },
    { 'action' : 'create', 'table' : 'tb4', 'set' : 'key6' },
    { 'action' : 'create', 'table' : 'tb3', 'set' : 'key7' },
    { 'action' : 'create', 'table' : 'tb4', 'set' : 'key8' },
    { 'action' : 'create', 'table' : 'tb3', 'set' : 'key9' },
    { 'action' : 'create', 'table' : 'tb3', 'set' : 'key10', 'input' : {
      'y' : { 'type' : 'binding', 'path' : ['key6', 'foo'] },
      'z' : { 'type' : 'binding', 'path' : ['key1', 'bar'] }
    } },
    { 'action' : 'create', 'table' : 'tb1', 'set' : 'key4', 'input' : {
      'x' : { 'type' : 'binding', 'path' : ['key2', 'baz'] }
    } },
    { 'action' : 'create', 'table' : 'tb4', 'set' : 'key11', 'input' : {
      'a' : { 'type' : 'binding', 'path' : ['key10', 'z'] },
      'b' : { 'type' : 'binding', 'path' : ['key1', 'bar'] }
    } },
    { 'action' : 'create', 'table' : 'tb1', 'set' : 'key12', 'input' : {
      'p' : { 'type' : 'binding', 'path' : ['key10', 'z'] },
      'q' : { 'type' : 'binding', 'path' : ['key11', 'a'] }
    } },
    { 'action' : 'create', 'table' : 'tb4', 'set' : 'key13' },
    { 'action' : 'create', 'table' : 'tb3', 'set' : 'key14' },
    { 'action' : 'create', 'table' : 'tb4', 'set' : 'key15', 'input' : {
      'a' : { 'type' : 'binding', 'path' : ['key8', 'z'] },
    } },
    { 'action' : 'create', 'table' : 'tb5', 'set' : 'key16' },
    { 'action' : 'create', 'table' : 'tb6', 'set' : 'key17' },
    { 'action' : 'create', 'table' : 'tb6', 'set' : 'key18', 'input' : {
      'm' : { 'type' : 'binding', 'path' : ['key4', 'x'] },
    } },
    { 'action' : 'create', 'table' : 'tb6', 'set' : 'key19', 'input' : {
      'm' : { 'type' : 'binding', 'path' : ['key4', 'x'] },
      'n' : { 'type' : 'binding', 'path' : ['key13', 'a'] },
    } },
    { 'action' : 'create', 'table' : 'tb6', 'set' : 'key20', 'input' : {
      'm' : { 'type' : 'binding', 'path' : ['key18', 'm'] },
      'n' : { 'type' : 'binding', 'path' : ['key17', 'm'] },
    } },
    { 'action' : 'create', 'table' : 'tb1', 'set' : 'key21' },
    { 'action' : 'create', 'table' : 'tb2', 'set' : 'key22', 'input' : {
      'w' : { 'type' : 'binding', 'path' : ['key18', 'm'] },
      'x' : { 'type' : 'binding', 'path' : ['key17', 'm'] },
    } },
    { 'action' : 'create', 'table' : 'tb2', 'set' : 'key23' },
    { 'action' : 'create', 'table' : 'tb3', 'set' : 'key24' },
]
    
dependencies = ExtractGraph(cmds)
levels = GetLevels(dependencies)
print(levels)

When run, this finds a few levels:

[
{'key9', 'key3', 'key24', 'key7', 'key17', 'key8', 'key21', 'key1',
     'key5', 'key2', 'key16', 'key6', 'key23', 'key13', 'key14'}, 
{'key15', 'key4', 'key10'}, 
{'key19', 'key18', 'key11'}, 
{'key22', 'key12', 'key20'}
]

For a spot check, let's look at key12. It has 10 and 11 as dependencies. key10 has 6 and 1. key11 has 10 and 1. keys 1 and 6 have none. We find

1 and 6 (no dependencies) in level 0,
10 (needs 1 and 6) in level 1,
11 (needs 10 and 1) in level 2,
12 (needs 10 and 11) in level 3.

Each job is being done as soon as its dependencies are met. So that's encouraging. More thorough testing is a must, however.

If you need to group these levels further into separate inserts per table, that's a post-processing step. The resulting inserts within a level can be done in parallel.

The letters like `p` and `q` don't really matter, sorry for the confusion, those are just pretend nested properties you might read from. The `set` is the output of applying the action, where the record is saved (under the variable in `set`). That way you can access it via whatever it was named via `set` later on. The inputs read values from looking at the `set` names as inputs. — Lance, Jan 19 '22 at 22:14
Would you mind explaining the python intricacies, I have never used python so it will take a while to unpack the graph library code and how it works. Maybe commenting on the lines would help. How would it translate to a JS graph library like say [this one](https://github.com/datastructures-js/graph)? — Lance, Jan 19 '22 at 22:28
[Here is my attempt](https://gist.github.com/lancejpollard/56653bcca565303383aae7e80a949cb9) at converting to JavaScript, stuck on a few points if you could clarify what they mean in python and how to translate. _(That JS graph library I chose is not very great lol)_ — Lance, Jan 19 '22 at 22:48
I'm not using a graph library. Just maps. This code should translate pretty much line for line to js. — Gene, Jan 20 '22 at 00:44
Ok it looks like I am getting the same output as you now (if you run this in Node.js). https://gist.github.com/lancejpollard/eadda9c3773b67237b203c33e8bb68b7 — Lance, Jan 20 '22 at 02:27

How to sort and chunk a dependency tree of actions, so you can batch as many actions as possible together at each step?

2 Answers2

EDIT: added infinite loop protection for both solutions