14

Here is my problem: I have a sequence S of (nonempty but possibly not distinct) sets s_i, and for each s_i need to know how many sets s_j in S (i ≠ j) are subsets of s_i.

I also need incremental performance: once I have all my counts, I may replace one set s_i by some subset of s_i and update the counts incrementally.

Performing all this using purely functional code would be a huge plus (I code in Scala).

As set inclusion is a partial ordering, I thought the best way to solve my problem would be to build a DAG that would represent the Hasse diagram of the sets, with edges representing inclusion, and join an integer value to each node representing the size of the sub-dag below the node plus 1. However, I have been stuck for several days trying to develop the algorithm that builds the Hasse diagram from the partial ordering (let's not talk about incrementality!), even though I thought it would be some standard undergraduate material.

Here is my data structure :

case class HNode[A] (
  val v: A,
  val child: List[HNode[A]]) {
  val rank = 1 + child.map(_.rank).sum
}

My DAG is defined by a list of roots and some partial ordering:

class Hasse[A](val po: PartialOrdering[A], val roots: List[HNode[A]]) {
  def +(v: A): Hasse[A] = new Hasse[A](po, add(v, roots))

  private def collect(v: A, roots: List[HNode[A]], collected: List[HNode[A]]): List[HNode[A]] =
    if (roots == Nil) collected
    else {
      val (subsets, remaining) = roots.partition(r => po.lteq(r.v, v))
      collect(v, remaining.map(_.child).flatten, subsets.filter(r => !collected.exists(c => po.lteq(r.v, c.v))) ::: collected)
    }
}

I am pretty stuck here. The last I came up to add a new value v to the DAG is:

  1. find all "root subsets" rs_i of v in the DAG, i.e., subsets of v such that no superset of rs_i is a subset of v. This can be done quite easily by performing a search (BFS or DFS) on the graph (collect function, possibly non-optimal or even flawed).
  2. build the new node n_v, the children of which are the previously found rs_i.
  3. Now, let's find out where n_v should be attached: for a given list of roots, find out supersets of v. If none are found, add n_v to the roots and remove subsets of n_v from the roots. Else, perform step 3 recursively on the supersets's children.

I have not yet implemented fully this algorithm, but it seems uncessarily circonvoluted and nonoptimal for my apparently simple problem. Is there some simpler algorithm available (Google was clueless on this)?

scand1sk
  • 1,114
  • 11
  • 25
  • That algorithm seems awfully simple to me, not unnecessarily convoluted. What exactly is the problem? The Scala code for it will barely be longer than your description. (Though I don't think you've even quite described it fully.) – Rex Kerr Jan 12 '12 at 19:58
  • Well, since I have gotten into functional programming (~6 months ago), I have been used to one-liners when dealing with recursive data structures. It feels awkward to develop a three-step algorithm, which does not lies in a single recursive call (step 1. is disconnected from step 3.) Also, this algorithms checks for subsets twice (step 1 and 3), which feels wrong. – scand1sk Jan 12 '12 at 23:46
  • As a reference, I recently implemented a binomial heap, which felt much easier (although it probably is because the algorithms were better defined). – scand1sk Jan 13 '12 at 00:01
  • You have two inherently different things to do: add the new set as a root node, if appropriate, and stick it into the list of children and build the appropriate child lists (at least one thing, probably two). Getting all that in one line of reasonable length seems awfully optimistic. – Rex Kerr Jan 13 '12 at 10:59
  • Actually, I managed to do it in a previously wrong analysis where I figured out that the partial ordering would lead to a tree. I thought that replacing the tree by a DAG would be easy, damn I was wrong: partial ordering means that subsets of my new element can appear anywhere in the DAG, not only in one particular subtree. – scand1sk Jan 14 '12 at 11:32
  • They can appear anywhere, but only where there is at least an element in common. So you don't have to check _everything_, just whether you are disjoint with your root elements. (If you are nondisjoint with a root, you must check all children you are not disjoint with.) – Rex Kerr Jan 14 '12 at 11:41
  • Yes I found this out, but then I have to drop my generalization to PartialOrdering (or extend PartialOrdering to add some "disjont-like" method). – scand1sk Jan 14 '12 at 12:43
  • I know it's an old post, but if someone comes across this it seems to me that the downward closure property of the Apriori algorithm might be applicable: http://en.wikipedia.org/wiki/Apriori_algorithm – Hans Jan 29 '15 at 01:52

2 Answers2

2

After some work, I finally ended up solving my problem, following my initial intuition. The collect method and rank evaluation were flawed, I rewrote them with tail-recursion as a bonus. Here is the code I obtained:

final case class HNode[A](
  val v: A,
  val child: List[HNode[A]]) {
  val rank: Int = 1 + count(child, Set.empty)

  @tailrec
  private def count(stack: List[HNode[A]], c: Set[HNode[A]]): Int =
    if (stack == Nil) c.size
    else {
      val head :: rem = stack
      if (c(head)) count(rem, c)
      else count(head.child ::: rem, c + head)
    }
}

// ...

  private def add(v: A, roots: List[HNode[A]]): List[HNode[A]] = {
    val newNode = HNode(v, collect(v, roots, Nil))
    attach(newNode, roots)
  }

  private def attach(n: HNode[A], roots: List[HNode[A]]): List[HNode[A]] =
    if (roots.contains(n)) roots
    else {
      val (supersets, remaining) = roots.partition { r =>
        // Strict superset to avoid creating cycles in case of equal elements
        po.tryCompare(n.v, r.v) == Some(-1)
      }
      if (supersets.isEmpty) n :: remaining.filter(r => !po.lteq(r.v, n.v))
      else {
        supersets.map(s => HNode(s.v, attach(n, s.child))) ::: remaining
      }
    }

  @tailrec
  private def collect(v: A, stack: List[HNode[A]], collected: List[HNode[A]]): List[HNode[A]] =
    if (stack == Nil) collected
    else {
      val head :: tail = stack

      if (collected.exists(c => po.lteq(head.v, c.v))) collect(v, tail, collected)
      else if (po.lteq(head.v, v)) collect(v, tail, head :: (collected.filter(c => !po.lteq(c.v, head.v))))
      else collect(v, head.child ::: tail, collected)
    }

I now must check some optimization: - cut off branches with totally distinct sets when collecting subsets (as Rex Kerr suggested) - see if sorting the sets by size improves the process (as mitchus suggested)

The following problem is to work the (worst case) complexity of the add() operation out. With n the number of sets, and d the size of the largest set, the complexity will probably be O(n²d), but I hope it can be refined. Here is my reasoning: if all sets are distinct, the DAG will be reduced to a sequence of roots/leaves. Thus, every time I try to add a node to the data structure, I still have to check for inclusion with each node already present (both in collect and attach procedures). This leads to 1 + 2 + … + n = n(n+1)/2 ∈ O(n²) inclusion checks.

Each set inclusion test is O(d), hence the result.

scand1sk
  • 1,114
  • 11
  • 25
  • Some simple benchmarks with randomly generated sets tends confirm the O(n²d) complexity even in the average case. – scand1sk Jan 19 '12 at 16:23
  • Above code has a bug: creation of HNodes in the attach procedure splits nodes in the DAG. I am working on this. – scand1sk Jan 23 '12 at 14:02
0

Suppose your DAG G contains a node v for each set, with attributes v.s (the set) and v.count (the number of instances of the set), including a node G.root with G.root.s = union of all sets (where G.root.count=0 if this set never occurs in your collection).

Then to count the number of distinct subsets of s you could do the following (in a bastardized mixture of Scala, Python and pseudo-code):

sum(apply(lambda x: x.count, get_subsets(s, G.root)))

where

get_subsets(s, v) :
   if(v.s is not a subset of s, {}, 
      union({v} :: apply(v.children, lambda x: get_subsets(s, x))))

In my opinion though, for performance reasons you would be better off abandoning this kind of purely functional solution... it works well on lists and trees, but beyond that the going gets tough.

mitchus
  • 4,677
  • 3
  • 35
  • 70
  • This answer assumes that the DAG exists, doesn't it? My first problem is to generate the DAG from the partial order. After some further research, it seems that I want to compute the reverse of a transitive closure and may be related to topological sorting. – scand1sk Jan 18 '12 at 13:38
  • Well, actually all that I have is the partial order. At the root of my problem, I do not have v.children. I want to find out children as efficiently as possible (I hope better than O(n²)) – scand1sk Jan 18 '12 at 13:53
  • Yes indeed, here i suppose that the DAG already exists. To build it, as a first step you can sort the sets by size; a subset is always smaller than a superset. As a next step, I would build an artificial root node with set = union of all sets. Then the idea becomes, take the sets in decreasing size order, create a node for it, and decide which are its "minimal" supersets; you want to link to those and only those. Start at the root node, and iteratively descend into all nodes that are supersets, until you reach those "minimal" supersets; each time you reach such a superset, add a link. – mitchus Jan 19 '12 at 12:39