I come from an imperative background and am trying to implement a simple disjoint sets (“union-find”) data structure to get some practice with creating and modifying (persistent) data structures in Haskell. The goal is to have a simple implementation, but I am also concerned about efficiency, and my question is related to this.
First, I created a disjoint-set forest implementation with union by rank and started by defining a data type for a “point”:
data Point = Point
{ _value :: Int
, _parent :: Maybe Point
, _rank :: Int
} deriving Show
A disjointed set forest is an IntMap
with Int → Point
mappings:
type DSForest = IntMap Point
empty :: DSForest
empty = I.empty
A singleton set is simply a mapping from its value x to a Point with value x, no parent and a rank of 1:
makeSet :: DSForest -> Int -> DSForest
makeSet dsf x = I.insert x (Point x Nothing 0) dsf
Now, the interesting part – union
. This operation will modify a point by setting the other point as its parent (and in some cases change its rank). In the case where the Point
s' rank are different, the Point
is simply “updated” (a new Point is created) to have its parent point to the other. In the case where they are equal, a new Point
is created with its rank increased by one:
union :: DSForest -> Int -> Int -> DSForest
union dsf x y | x == y = dsf
union dsf x y =
if _value x' == _value y'
then dsf
else case compare (_rank x') (_rank y') of
GT -> I.insert (_value y') y'{ _parent = Just x' } dsf
LT -> I.insert (_value x') x'{ _parent = Just y' } dsf
-- 1) increase x's rank by one:
EQ -> let x'' = x'{ _rank = _rank x' + 1 }
-- 2) update the value for x's rank to point to the new x:
dsf' = I.insert (_value x'') x'' dsf
-- 3) then update y to have the new x as its parent:
in I.insert (_value y') y'{ _parent = Just x'' } dsf'
where x' = dsf ! findSet dsf x
y' = dsf ! findSet dsf y
Now, to my real question, if in the EQ
case I had instead done the following:
EQ -> let dsf' = I.insert (_value x') x'{ _rank = _rank x' + 1} dsf
in I.insert (_value y') y'{ _parent = Just x'{ _rank = _rank x' + 1 }} dsf'
I.e. first insert a new Point
x with its rank increased, and then having y'
's parent be a new Point
x with its rank increased, would this mean that they no longer point to the same Point
in memory? (Does this even matter? Should I worry about these things when using/creating persistent data structures?)
And just for completeness, here is findSet
:
findSet :: DSForest -> Int -> Int
findSet dsf' x' = case _parent (dsf' ! x') of
Just (Point v _ _) -> findSet dsf' v
Nothing -> x'
(General comments about the efficiency and design of this code are also welcome.)