3

How it goes: Based on the set of tuple (id, x, y), find the min max for x and y , then two dots (red points) created. Each element in tuple are grouped to two groups based on the distance towards the red dots.

phase 1

Each group cant exceed 5 dots. If exceed, new group should be computed. I've managed to do recursion for the first phase. But I have no idea how to do it for second phase. The second phase should look like this:

phase 2

Based on these two groups, again it need to find the min max for x and y (for each group), then four dots (red points) created. Each element in tuple are grouped to two groups based on the distance towards the red dots.

getDistance :: (Int, Double, Double) -> (Int, Double, Double) -> Double
getDistance (_,x1,y1) (_,x2,y2) = sqrt $ (x1-x2)^2 + (y1-y2)^2
getTheClusterID :: (Int, Double, Double) -> Int
getTheClusterID  (id, _, _) = id

idxy = [(id, x, y)]
createCluster id cs = [(id, minX, minY),(id+1, maxX, minY), (id+2, minX, maxY), (id+3, maxX, maxY)]
                        where minX = minimum $ map (\(_,x,_,_) -> x) cs
                              maxX = maximum $ map (\(_,x,_,_) -> x) cs
                              minY = minimum $ map (\(_,_,y,_) -> y) cs
                              maxY = maximum $ map (\(_,_,y,_) -> y) cs
idCluster = [1]
cluster = createCluster (last idCluster) idxy

clusterThis (id,a,b) = case (a,b) of
  j | getDistance (a,b) (cluster!!0) < getDistance (a,b) (cluster!!1) &&
        -> (getTheClusterID (cluster!!0), a, b) 
  j | getDistance (a,b) (cluster!!1) < getDistance (a,b) (cluster!!0) &&
        -> (getTheClusterID (cluster!!1), a, b)
  _ -> (getTheClusterID (cluster!!0), a, b)

groupAll = map clusterThis idxy

I am moving from imperative to functional. Sorry if my way of thinking is still in imperative way. Still learning.

EDIT: To clarify, this is the original data looks like.

Phase 0

Sir DK
  • 179
  • 1
  • 9
  • 4
    This could really use more explicit types and signatures. The pictures are good, though, and the problem is understandable. I think if you expressed your data in the more readable way it would be easier for you to express the solution as well. – Bartek Banachewicz Aug 30 '17 at 11:11
  • @BartekBanachewicz the data is pair of coordinates with and id (id, x, y) - (Int, Double, Double) – Sir DK Aug 30 '17 at 11:15
  • The fact you had to explain it is enough to rethink how you store it. Create distinct types representing your points and your clusters. – Bartek Banachewicz Aug 30 '17 at 11:16
  • Does the points and clusters need to converted to a new list in each phase? – Sir DK Aug 30 '17 at 11:22

2 Answers2

3

The basic principle to follow in writing such an algorithm is to write small, compositional programs; each program is then easy to reason about and test in isolation, and the final program can be written in terms of the smaller ones.

The algorithm can be summarized as follows:

  1. Compute the points which bound the set of points.
  2. Split the rest of the points into two clusters, one containing points closer to the minimum point, the other containing all other points (equivalently, points closer to the maximum point).
  3. If any cluster contains more than 5 points, repeat the process on that cluster.

The presence of a 'repeat the process' step indicates this to be a divide and conquer problem.

I see no need for an ID for each point, so I've dispensed with this.

To begin, define datatypes for each type of data you will be working with:

import Data.List (partition)

data Point = Point { ptX :: Double, ptY :: Double }
data Cluster = Cluster { clusterPts :: [Point] }

This may seem silly for such simple data, but it can potentially save you quite a bit of confusion during debugging. Also note the import of a function we will be using later.

The 1st step:

minMaxPoints :: [Point] -> (Point, Point)
minMaxPoints ps = 
   (Point minX minY
   ,Point maxX maxY)
     where minX = minimum $ map ptX ps
           maxX = maximum $ map ptX ps
           minY = minimum $ map ptY ps
           maxY = maximum $ map ptY ps

This is essentially the same as your createCluster function.

The 2nd step:

pointDistance :: Point -> Point -> Double
pointDistance (Point x1 y1) (Point x2 y2) = sqrt $ (x1-x2)^2 + (y1-y2)^2

cluster1 :: [Point] -> [Cluster]
cluster1 ps =
  let (mn, mx) = minMaxPoints ps
      (psmn, psmx) = partition (\p -> pointDistance mn p < pointDistance mx p) ps
  in [ Cluster psmn, Cluster psmx ]

This function should clear - it is a direct translation of the above statement of this step into code. The partition function takes a predicate and a list and produces two lists, the first containing all elements for which the predicate is true, and the second all elements for which it is false. pointDistance is essentially the same as your getDistance function.

The 3rd step:

cluster :: [Point] -> [Cluster]
cluster ps =
  cluster1 ps >>= \cl@(Cluster c) ->
  if length c > 5
  then cluster c
  else [cl]

This also implements the statement above very directly. Perhaps the only confusing part is the use of >>=, which (here) has type [a] -> (a -> [b]) -> [b]; it simply applies the given function to each element of the given list, and concatenates the result (equivalently, it is written flip concatMap).

Finally your test case (which I hope I've translated correctly from pictures to Haskell data):

testPts :: [Point]
testPts = map (uncurry Point)
  [ (0,0), (1,0), (2,1), (0,2)
  , (5,2), (5,4), (4,3), (4,4)
  , (8,2), (9,3), (10,2)
  , (11,4), (12,3), (13,3), (13,5) ]

main = mapM_ (print . map (\p -> (ptX p, ptY p)) . clusterPts) $ cluster testPts

Running this program produces

[(0.0,0.0),(0.0,2.0),(2.0,1.0),(1.0,0.0)]
[(4.0,4.0),(5.0,2.0),(5.0,4.0),(4.0,3.0)]
[(10.0,2.0),(9.0,3.0),(8.0,2.0)]
[(13.0,3.0),(12.0,3.0),(11.0,4.0),(13.0,5.0)]
user2407038
  • 14,400
  • 3
  • 29
  • 42
  • I don't think your `minMax` is correct. I'm reading the question as the author wanting a bounding box ("find the min max for `x` and `y`, then two dots (red points) created"), whereas `minMaxPoints` selects the bottom-left-most and top-right-most points from the set. – Benjamin Hodgson Aug 30 '17 at 13:09
  • @BenjaminHodgson the way I'm reading the question, a red point can be part of the original set. The top right one is not really visibile in the picture, because the distance to the created red point is 0. ;-) – Marc Lambrichs Aug 30 '17 at 13:30
  • 2
    A red point _can be_ part of the original set, but need not be. – Benjamin Hodgson Aug 30 '17 at 13:31
  • @BenjaminHodgson So I guess I got confused. I interpreted it precisely as wanting the bottom-left-most and top-right-most points. As in, the red points are strictly members of the input. I'm still unclear whether the red points are part of the cluster proper (i.e. whether they count towards the 5 point maximum if they did not occur in the input set); perhaps this is also clear to you, or the OP can clarify. Thanks for pointing this out. – user2407038 Aug 30 '17 at 14:14
  • 3
    The thing that rang my alarm bell is "bottom-left-most". That's an ill-defined notion in 2D space; not every set of points has a bottom-left-most element (eg `[(1, 0), (0, 1)]`). – Benjamin Hodgson Aug 30 '17 at 14:24
  • I've edited the question by putting in the original data (how it looks like). I'm in the process of digesting your solution @user2407038. Sorry for being that slow ;-)))) I hope that I can be at your level one day and able to help others too. – Sir DK Aug 31 '17 at 07:55
  • @SirDK See the edit (I omitted it originally for brevity). – user2407038 Aug 31 '17 at 08:35
  • @user2407038 Can you elaborate in detail the step 2? Because I couldn't get your points. But if that is the simplest explanation, than its me that have trouble to understand it. Its the cluster1 function, between the let and in. I'm not familiar with this way. By the way the code works fine. Get the answer. I've edited a bit at the end since I compiled it using Leksah. – Sir DK Aug 31 '17 at 11:06
  • 1
    @SirDK Are you asking what a `let .. in ..` blocks means? If so, see [here](https://www.haskell.org/onlinereport/haskell2010/haskellch3.html#x8-440003.12). If you're asking about the use of `partition`, then probably the easiest way to understand it is to play around with some simpler examples. E.g. `partition (< 3) [8,5,3,1,-10,2] == ([1,-10,2], [8,5,3])`. One way to define `partition` is `partition p xs = (filter p xs, filter (not . p) xs)`. Does this help? – user2407038 Aug 31 '17 at 11:15
  • @user2407038 Yes yes yes. That really help. And Ive just noticed about the partition function. Thought that it is a new function. In step 3, there is "cl@(Cluster c)". How does it work? – Sir DK Aug 31 '17 at 11:35
  • @SirDK This is called an "as pattern" (see the section of the report on [pattern matching](https://www.haskell.org/onlinereport/haskell2010/haskellch3.html#x8-580003.17) - in general, I highly recommend you read sections 2-5 for an understanding of the basic syntax). When you write `\cl -> .. :: Cluster -> ..` you have a variable `cl :: Cluster` in scope. When you write `\(Cluster c) -> ..` you have a variable `c :: [Point]` in scope. An as pattern combines both of these - writing `\cl@(Cluster c) -> ..` gives you both `cl::Cluster` and `c::[Point]` in scope of the pattern. – user2407038 Aug 31 '17 at 11:39
  • @user2407038 partition does limit us into two "partitions" right (that fulfill the condition and not)? What if we change the condition from two red dots (bottom left and upper right) to four red dots (bottom left, upper left, bottom right and upper right). Will it still be possible to alter the condition? – Sir DK Aug 31 '17 at 12:19
  • @user2407038 I tried to change the length c to > 6 (which initially set to > 5). And the result is not as it should be. And I'm not sure why. – Sir DK Aug 31 '17 at 12:58
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/153391/discussion-between-user2407038-and-sir-dk). – user2407038 Aug 31 '17 at 16:23
3

Functional programmers love recursion, yet they go to great lengths to avoid writing it. Jeez, people, make up your minds!

I like to structure my code, to the extent possible, using common, well-understood combinators. I want to demonstrate a style of Haskell programming which leans heavily on standard tools to implement the boring parts of a program (mapping, zipping, looping) as tersely and generically as possible, freeing you up to focus on the problem at hand.

So don't worry if you don't understand everything here. I just want to show you what's possible! (And please ask if you have questions!)

Vectors

First things first: we're working with two-dimensional space, so we'll need two-dimensional vectors and some secondary school vector algebra to work with them.

I'm going to parameterise my vector by the scalar on which our vector space is built. This'll allow me to work with standard type classes like Functor, so I can delegate a lot of the work of building a vector algebra to the machine. I've turned on DeriveFunctor and DeriveFoldable, which allow me to utter the magic words deriving (Functor, Foldable).

data Pair a = Pair {
    px :: a,
    py :: a
} deriving (Show, Functor, Foldable)

Hereafter I'm going to avoid working explicitly with Pair, and program to an interface, not an implementation. This'll allow me to build a simple linear algebra library in a manner that's independent of the dimensionality of the vector space. I'll give example type signatures in terms of V2:

type V2 = Pair Double

Scalar multiplication: functors

A vector space is required to have two operations: scalar multiplication and vector addition. Scalar multiplication means multiplying each component of a vector by a constant scalar. If you view a vector as a container of components, it should be clear that this means "do the same thing to every element in a container" - that is, it's a mapping operation. That's what Functor is for.

-- mul :: Double -> V2 -> V2
mul :: (Functor f, Num n) => n -> f n -> f n
mul k f = fmap (k *) f

Vector addition: zippy applicatives

Vector addition involves adding up the components of a vector point-wise. Thinking of a vector as a container of components, addition is a zipping operation - match up each element of the two vectors and add them up.

Applicative functors are functors with an additional "apply" operation. Thinking of a functor f as a container, Applicative's <*> :: f (a -> b) -> f a -> f b gives you a way to take a container of functions and apply it to a container of values to get a new container of values. It should be clear that one way to make Pair into an Applicative is to use zipping to apply functions to values.

instance Applicative Pair where
    pure x = Pair x x
    Pair f g <*> Pair x y = Pair (f x) (g y)

(For another example of a zippy applicative, see this answer of mine.)

Now that we have a way to zip two pairs, we can leverage a bit of standard Applicative machinery to implement vector addition.

-- add :: V2 -> V2 -> V2
add :: (Applicative f, Num n) => f n -> f n -> f n
add = liftA2 (+)

Vector subtraction, which gives you a way to find the distance between two points, is defined in terms of multiplication and addition.

-- minus :: V2 -> V2 -> V2
minus :: (Applicative f, Num n) => f n -> f n -> f n
v `minus` u = v `add` mul (-1) u

Dot products: foldable containers

2D Euclidean space is actually a Hilbert space - a vector space equipped with a way to measure lengths and angles in the form of a dot product. To take the dot product of two vectors, you multiply the components together and then add up the results. Once more, we'll be using Applicative to multiply the components, but that just gives us another vector: how do we implement "adding up the results"?

Foldable is the class of containers which admit an "aggregation" operation foldr :: (a -> b -> b) -> b -> f a -> b. The standard prelude's sum is defined in terms of foldr, so:

-- dot :: V2 -> V2 -> Double
dot :: (Applicative f, Foldable f, Num n) => f n -> f n -> n
v `dot` u = sum $ liftA2 (*) v u

This gives us a way to find the absolute length of a vector: dot it with itself and take the square root.

-- modulus :: V2 -> Double
modulus :: (Applicative f, Foldable f, Floating n) => f n -> n
modulus v = sqrt $ v `dot` v

So the distance between two points is the modulus of the difference of the vectors.

dist :: (Applicative f, Foldable f, Floating n) => f n -> f n -> n
dist v u = modulus (v `minus` u)

N-ary zipping: traversable containers

An axis-aligned (hyper-)rectangle can be defined by just two points. We'll represent the bounding box of a set of points as a Pair of vectors pointing to opposite corners of the bounding box.

Given a collection of vectors of components, we can find the opposite corners of the bounding box by finding the maximum and minimum of each component across the collection. This requires us to zip up, or transpose, a collection of vectors of components into a vector of collections of components. For this I'll use Traversable's sequenceA.

-- boundingBox :: [V2] -> Pair V2
boundingBox :: (Traversable t, Applicative f, Ord n) => t (f n) -> Pair (f n)
boundingBox vs =
    let components = sequenceA vs
    in Pair (minimum <$> components) (maximum <$> components)

Clustering

Now that we have a library for working with vectors, we can get down to the meaty part of the algorithm: dividing sets of points into clusters.

Partitioning

Let me rephrase the specification of the inner loop of your algorithm. You want to partition a set of points based on whether they're closer to the bottom-left corner of the set's bounding box or to the top-right corner. That's what partition does.

We can write a function, whichCluster which uses minus and modulus to decide this for a single point, and then use partition to apply it to the whole set.

type Cluster = []
-- cluster :: Cluster V2 -> [Cluster V2]
cluster :: (Applicative f, Foldable f, Ord n, Floating n) => Cluster (f n) -> [Cluster (f n)]
cluster vs =
    let Pair bottomLeft topRight = boundingBox vs
        whichCluster v = dist v bottomLeft <= dist v topRight
        (g1, g2) = partition whichCluster vs
    in [g1, g2]

Repetition, repetition, repetition

Now we want to repeatedly cluster until we don't have any groups larger than 5. Here's the plan. We'll keep track of two sets of clusters, those which are small enough, and those which require further sub-clustering. I'll use partition to sort a list of clusters into those which are small enough and those which need subclustering. I'll use the list monad's >>= :: [a] -> (a -> [b]) -> [b] (here [Cluster V2] -> ([V2] -> [Cluster V2]) -> [Cluster V2]), which maps a function over a list and flattens the result, to implement the notion of subclustering. And I'll use until to repeatedly subcluster until the set of remaining too-large clusters is empty.

-- smallClusters :: Int -> Cluster V2 -> [Cluster V2]
smallClusters :: (Applicative f, Foldable f, Ord n, Floating n) => Int -> Cluster (f n) -> [Cluster (f n)]
smallClusters maxSize vs = fst $ until (null . snd) splitLarge ([], [vs])
    where
        smallEnough xs = length xs <= maxSize
        splitLarge (small, remaining) =
            let (newSmall, large) = partition smallEnough remaining
            in (small ++ newSmall, large >>= cluster)

A quick test, cribbed from @user2407038's answer:

testPts :: [V2]
testPts = map (uncurry Pair)
    [ (0,0), (1,0), (2,1), (0,2)
    , (5,2), (5,4), (4,3), (4,4)
    , (8,2), (9,3), (10,2)
    , (11,4), (12,3), (13,3), (13,5) ]

ghci> smallClusters 5 testPts
[
    [Pair {px = 0.0, py = 0.0},Pair {px = 1.0, py = 0.0},Pair {px = 2.0, py = 1.0},Pair {px = 0.0, py = 2.0}],
    [Pair {px = 5.0, py = 2.0},Pair {px = 5.0, py = 4.0},Pair {px = 4.0, py = 3.0},Pair {px = 4.0, py = 4.0}],
    [Pair {px = 8.0, py = 2.0},Pair {px = 9.0, py = 3.0},Pair {px = 10.0, py = 2.0}]
    [Pair {px = 11.0, py = 4.0},Pair {px = 12.0, py = 3.0},Pair {px = 13.0, py = 3.0},Pair {px = 13.0, py = 5.0}]
    ]

There you go. Small clusters in n-dimensional space, all without a single recursive function.

Labelling

Part of the point of working with the Applicative and Foldable interfaces, rather than working with V2 directly, was so I could demonstrate the following little magic trick.

Your original code represented points as 3-tuples consisting of two Doubles for the location and an Int for the point's label, but my V2 has no label. Can we recover this? Well, since the code doesn't at any point mention any concrete types - just standard type classes - we can just build a new type for labelled vectors. As long as said type is a Foldable Applicative all of the above code will continue to work without modification!

data Labelled m f a = Labelled m (f a) deriving (Show, Functor, Foldable)

instance (Monoid m, Applicative f) => Applicative (Labelled m f) where
    pure = Labelled mempty . pure
    Labelled m ff <*> Labelled n fx = Labelled (m <> n) (ff <*> fx)

The Monoid constraint is there because when combining actions you also need a way to combine their labels. I'm just going to use First - left-biased choice - because I'm not expecting the points' labels to be relevant to the zipping operations like modulus and boundingBox.

type LabelledV2 = Labelled (First Int) Pair Double

testPts :: [LabelledV2]
testPts = zipWith (Labelled . First . Just) [0..] $ map (uncurry Pair)
    [ (0,0), (1,0), (2,1), (0,2)
    , (5,2), (5,4), (4,3), (4,4)
    , (8,2), (9,3), (10,2)
    , (11,4), (12,3), (13,3), (13,5) ]

ghci> traverse (traverse (getFirst . lbl)) $ smallClusters 5 testPts
Just [[0,1,2,3],[4,5,6,7],[8,9,10],[11,12,13,14]]  -- try reordering testPts
Benjamin Hodgson
  • 42,952
  • 15
  • 108
  • 157
  • Why do you bring up vector spaces? It doesn't seem to make your response any clearer, and you're only really working in R^2 – Brendan Murphy Aug 30 '17 at 17:49
  • 1
    @BrendanMurphy I'm not just working in R^2. The questioneer wants to work with `Labelled` things. Also, if I were only working with R^2 in the form of `(Double, Double)` I'd end up writing by hand all the code the machine wrote for me – Benjamin Hodgson Aug 30 '17 at 17:50