Haskell - split a list into two sublists with closest sums

Question

I'm a Haskell beginner trying to learn more about the language by solving some online quizzes/problem sets.

The problem/question is quite lengthy but a part of it requires code that can find the number which divides a given list into two (nearly) equal (by sum) sub-lists.

Given [1..10]

Answer should be 7 since 1+2+..7 = 28 & 8+9+10 = 27

This is the way I implemented it

-- partitions list by y
partishner :: (Floating a) => Int -> [a] -> [[[a]]]
partishner 0 xs = [[xs],[]]
partishner y xs = [take y xs : [drop y xs]] ++ partishner (y - 1) xs


-- finds the equal sum
findTheEquilizer :: (Ord a, Floating a) => [a] -> [[a]]
findTheEquilizer xs = fst $ minimumBy (comparing snd) zipParty
  where party = (tail . init) (partishner (length xs) xs) -- removes [xs,[]] types
        afterParty = (map (\[x, y] -> (x - y) ** 2) . init . map (map sum)) party
        zipParty = zip party afterParty -- zips partitions and squared diff betn their sums

Given (last . head) (findTheEquilizer [1..10]) output : 7

For numbers near 50k it works fine

λ> (last . head) (findTheEquilizer [1..10000])                                                   
   7071.0

The trouble starts when I put in lists with any more than 70k elements in it. It takes forever to compute.

So what do I have to change in the code to make it run better or do I have to change my whole approach? I'm guessing it's the later, but I'm not sure how to go about do that.

What exactly are `partishner`, etc. doing? One can here basically solve the problem in two passes: first calcuate the sum of the list, then each time iterate over an element add the element to the sum on the left, and subtract it at the sum of the right, and keep track of what the optimial delta is. — Willem Van Onsem, Nov 07 '18 at 14:16
If your lists are always of `[1..n]` type then this is just a simple math problem involving quadratic formula. — Redu, Nov 07 '18 at 15:07
@Redu No that's just an example. I should've made it clear in my question. The input could have any number of variation, but they'll always be ordered. — , Nov 07 '18 at 15:30

score 2 · Accepted Answer · answered Nov 07 '18 at 14:57

It looks to me that the implementation is quite chaotic. For example partishner seems to construct a list of lists of lists of a, where, given I understood it correctly, the outer list contains lists with each two elements: the list of elements on "the left", and the list of elements at the "right". As a result, this will take O(n²) to construct the lists.

By using lists over 2-tuples, this is also quite "unsafe", since a list can - although here probably impossible - contain no elements, one element, or more than two elements. If you make a mistake in one of the functions, it will be hard to find out that mistake.

It looks to me that it might be easier to implement a "sweep algorithm": we first calculate the sum of all the elements in the list. This is the value on the "right" in case we decide to split at that specific point, next we start moving from left to right, each time subtracting the element from the sum on the right, and adding it to the sum on the left. We can each time evaluate the difference in score, like:

import Data.List(unfoldr)

sweep :: Num a => [a] -> [(Int, a, [a])]
sweep lst = x0 : unfoldr f x0
    where x0 = (0, sum lst, lst)
          f (_, _, []) = Nothing
          f (i, r, (x: xs)) = Just (l, l)
              where l = (i+1, r-2*x, xs)

For example:

Prelude Data.List> sweep [1,4,2,5]
[(0,12,[1,4,2,5]),(1,10,[4,2,5]),(2,2,[2,5]),(3,-2,[5]),(4,-12,[])]

So if we select to split at the first split point (before the first element), the sum on the right is 12 higher than the sum on the left, if we split after the first element, the sum on the right (11) is 10 higher than the sum on the left (1).

We can then obtain the minimum of these splits with minimumBy :: (a -> a -> Ordering) -> [a] -> a:

import Data.List(minimumBy)
import Data.Ord(comparing)

findTheEquilizer :: (Ord a, Num a) => [a] -> ([a], [a])
findTheEquilizer lst = (take idx lst, tl)
    where (idx, _, tl) = minimumBy (comparing (abs . \(_, x, _) -> x)) (sweep lst)

We then obtain the correct value for [1..10]:

Prelude Data.List Data.Ord Data.List> findTheEquilizer [1..10]
([1,2,3,4,5,6,7],[8,9,10])

or for 70'000:

Prelude Data.List Data.Ord Data.List> head (snd (findTheEquilizer [1..70000]))
49498

The above is not ideal, it can be implemented more elegantly, but I leave this as an exercise.

`head . snd` is the wrong choice for the 70K case though. it is the other one. — Will Ness, Nov 07 '18 at 16:07

score 1 · Answer 2 · answered Nov 07 '18 at 17:00

Okay, firstly, let analyse why it run forever (...actually not forever, just slow), take a look of partishner function:

partishner y xs = [take y xs : [drop y xs]] ++ partishner (y - 1) xs

where take y xs and drop y xs are run linear time, i.e. O(N), and so as

[take y xs : [drop y xs]]

is O(N) too.

However, it is run again and again in recursive way over each element of given list. Now suppose the length of given list is M, each call of partishner function take O(N) times, to finish computation need:

O(1+2+...M) = (M(1+M)/2) ~ O(M^2)

Now, the list has 70k elements, it at least need 70k ^ 2 step. So why it hang.

Instead of using partishner function, you can sum the list in linear way as:

sumList::(Floating a)=>[a]->[a]
sumList xs = sum 0 xs
    where sum _ [] = []
          sum s (y:ys) = let s' = s + y in s' : sum s' ys

and findEqilizer just sum the given list from left to right (leftSum) and from right to left (rightSum) and take the result just as your original program, but the whole process just take linear time.

findEquilizer::(Ord a, Floating a) => [a] -> a
findEquilizer [] = 0 
findEquilizer xs = 
    let leftSum  = reverse $ 0:(sumList $ init xs)
        rightSum = sumList $ reverse $ xs
        afterParty = zipWith (\x y->(x-y) ** 2) leftSum rightSum
    in  fst $ minimumBy (comparing snd) (zip (reverse $ init xs) afterParty)

@WillNess Great! but sumList show the detail how it work may be more suitable for Haskell beginner — assembly.jc, Nov 08 '18 at 07:21
it's always good to re-code a library definition by yourself to see how it does its work. :) — Will Ness, Nov 08 '18 at 07:24

dfeuer · Answer 3 · 2018-11-07T20:58:55.100

1

I assume that none of the list elements are negative, and use a "tortoise and hare" approach. The hare steps through the list, adding up elements. The tortoise does the same thing, but it keeps its sum doubled and it carefully ensures that it only takes a step when that step won't put it ahead of the hare.

approxEqualSums
  :: (Num a, Ord a)
  => [a] -> (Maybe a, [a])
approxEqualSums as0 = stepHare 0 Nothing as0 0 as0
  where
    -- ht is the current best guess.
    stepHare _tortoiseSum ht tortoise _hareSum []
      = (ht, tortoise)
    stepHare tortoiseSum ht tortoise hareSum (h:hs)
      = stepTortoise tortoiseSum ht tortoise (hareSum + h) hs

    stepTortoise tortoiseSum ht [] hareSum hare
      = stepHare tortoiseSum ht [] hareSum hare
    stepTortoise tortoiseSum ht tortoise@(t:ts) hareSum hare
      | tortoiseSum' <= hareSum
      = stepTortoise tortoiseSum' (Just t) ts hareSum hare
      | otherwise
      = stepHare tortoiseSum ht tortoise hareSum hare
      where tortoiseSum' = tortoiseSum + 2*t

In use:

> approxEqualSums [1..10]
(Just 6,[7,8,9,10])

6 is the last element before going over half, and 7 is the first one after that.

edited Nov 07 '18 at 20:58

answered Nov 07 '18 at 20:36

dfeuer

48,079
5
63
167

the result for `[1..10]` is incorrect, though for e.g. `[1..12]` it is correct. also, the results for lists like `replicate 100 1` are meaningless. – Will Ness Nov 08 '18 at 02:39
@WillNess, what do you mean by "incorrect"? And why do you call those results "meaningless"? – dfeuer Nov 08 '18 at 03:08
@WillNess, note that my version produces the last "at most half" one and also a list whose first element is the first "over half" one. So you take your pick or average them or whatever you like. – dfeuer Nov 08 '18 at 03:28
@WillNess, `sum [1..7] = 28` and `sum [8..10] = 27`. So `7` is one too many. My interpretation of the question is that the ultimate goal was to find the number that crosses, and that splitting the list was just the OP's approach to doing so (i.e., something of an XY problem). Of course, my solution can easily be adapted to produce the split index as well, allowing the user to recover the first list with `take` (the second list comes for free already). – dfeuer Nov 08 '18 at 17:05
no, not one too many. the OP wants the minimal difference between the sums. -1 is better than 13. they even run it through `((x - y) ** 2)` and do a `minimumBy (comparing snd)` on it. – Will Ness Nov 08 '18 at 17:07
@WillNess, ah, I wasn't too clear on exactly what they meant. But I think the real point is that the function I offer lets them choose whichever they like quite easily in post-processing. I can clarify. – dfeuer Nov 08 '18 at 17:08

Redu · Answer 4 · 2018-11-07T22:45:17.953

I asked in the comment and OP says [1..n] is not really defining the question. Yes i guess what's asked is like [1 -> n] in random ascending sequence such as [1,3,7,19,37,...,1453,...,n].

Yet..! Even as per the given answers, for a list like [1..n] we really don't need to do any list operation at all.

The sum of [1..n] is n*(n+1)/2.
Which means we need to find m for n*(n+1)/4
Which means m(m+1)/2 = n*(n+1)/4.
So if n == 100 then m^2 + m - 5050 = 0

All we need is formula where a = 1, b = 1 and c = -5050 yielding the reasonable root to be 70.565 ⇒ 71 (rounded). Lets check. 71*72/2 = 2556 and 5050-2556 = 2494 which says 2556 - 2494 = 62 minimal difference (<71). Yes we must split at 71. So just do like result = [[1..71],[72..100]] over..!

But when it comes to not subsequent ascending, that's a different animal. It has to be done by first finding the sum and then like binary search by jumping halfway the list and comparing the sums to decide whether to jump halfway back or forward accordingly. I will implement that one later.

Will Ness · Answer 5 · 2018-11-11T09:07:13.197

Here's a code which is empirically behaving better than linear, and gets to the 2,000,000 in just over 1 second even when interpreted:

g :: (Ord c, Num c) => [c] -> [(Int, c)]
g = head . dropWhile ((> 0) . snd . last) . map (take 2) . tails . zip [1..]
         . (\xs -> zipWith (-) (map (last xs -) xs) xs) . scanl1 (+) 

g [1..10]      ==> [(6,13),(7,-1)]                        -- 0.0s
g [1..70000]   ==> [(49497,32494),(49498,-66502)]         -- 0.09s
g [70000,70000-1..1] ==> [(20502,66502),(20503,-32494)]   -- 0.09s
g [1..100000]  ==> [(70710,75190),(70711,-66232)]         -- 0.11s
g [1..1000000] ==> [(707106,897658),(707107,-516556)]     -- 0.62s
g [1..2000000] ==> [(1414213,1176418),(1414214,-1652010)] -- 1.14s  n^0.88
g [1..3000000] ==> [(2121320,836280),(2121321,-3406362)]  -- 1.65s  n^0.91

It works by running the partial sums with scanl1 (+) and taking the total sum as its last, so that for each partial sum, subtracting it from the total gives us the sum of the second part of the split.

The algorithm assumes all the numbers in the input list are strictly positive, so the partial sums list is monotonically increasing. Nothing else is assumed about the numbers.

The value must be chosen from the pair (the g's result) so that its second component's absolute value is the smaller between the two.

This is achieved by minimumBy (comparing (abs . snd)) . g.

clarifications: There's some confusion about "complexity" in the comments below, yet the answer says nothing at all about complexity but uses a specific empirical measurement. You can't argue with empirical data (unless you misinterpret its meaning).

The answer does not claim it "is better than linear", it says "it behaves better than linear" [in the tested range of problem sizes], which the empirical data incontrovertibly show.

Finally, an appeal to authority. Robert Sedgewick is an authority on algorithms. Take it up with him.

(and of course the algorithm handles unordered data as well as it does ordered).

As for the reasons for OP's code inefficiency: map sum . inits can't help being quadratic, but the equivalent scanl (+) 0 is linear. The radical improvement comes about from a lot of redundant calculations in the former being avoided in the latter. (Another example of this can be seen here.)

Don't you need at least three points to reach that? You've likely just captured some fixed cost in the interpreter (a constant term) or even a meaningless artifact (GC, etc.) — dfeuer, Nov 07 '18 at 15:59
You are also only testing lists of the form `[1..n]`, which the OP has clarified is not always the case; the input is just ordered. — chepner, Nov 07 '18 at 16:16
The fact that you are using `last` means you are scanning the entire input. All you are demonstrating is that there's a fixed amount of overhead that becomes decreasingly relevant as the list gets longer. — chepner, Nov 07 '18 at 16:36
It's not "better than linear", it *is* linear; the exponent is going up because you have a cost function like `c(n) + x`, and you are seeing the decreasing contribution of `x` as `n` increases. — chepner, Nov 07 '18 at 16:53
that's... how you measure *empirical* orders of growth.(\*) all the above comments are true, and all do not address what the answer is saying. yes there's a diminishing contribution from a constant factor, and yes, it is seen as a *sub*-linear *empirical* orders of growth with increasing (towards 1.0) power coefficient. (\*) of course one could get fancy and perform a linear regression discovering said constant factor intercept and the exact linear power law coefficient, but simple is simple enough, and is enough. — Will Ness, Oct 04 '20 at 15:18

Haskell - split a list into two sublists with closest sums

5 Answers5