subsequences of length n from list performance

Question

I implemented a version of this answer https://stackoverflow.com/a/9920425/1261166 (I don't know what was intended by the person answering)

sublistofsize 0 _        = [[]]
sublistofsize _ []       = []
sublistofsize n (x : xs) = sublistsThatStartWithX ++ sublistsThatDontStartWithX
  where sublistsThatStartWithX = map (x:) $ sublistofsize (n-1) xs
        sublistsThatDontStartWithX = sublistofsize n xs

what I'm unsure of is sublistsThatStartWithX = map (x:) $ sublistofsize (n-1) xs

I assume that map (x:) gives a problem performance wise, but not sure of how to solve it. I've done profiling on print $ length $ sublistofsize 5 $ primesToTakeFrom 50

COST CENTRE                                  MODULE                                        no.     entries  %time %alloc   %time %alloc
sublistofsize                             Main                                          112     4739871   46.9   39.9    96.9  100.0
 sublistofsize.sublistsThatDontStartWithX Main                                          124     2369935    2.2    0.0     2.2    0.0
 sublistofsize.sublistsThatStartWithX     Main                                          116     2369935   47.8   60.1    47.8   60.1

Did I implement it in a good way? Are there any faster ways of doing it?

Have you measured a performance problem? This problem is fundamentally linear in the size of the output and that `map` won't change that. — Ganesh Sittampalam, Jan 21 '14 at 17:49
My thinking was that map (x:) makes the x hang and wait for the return values of the recursive call, or maybe I'm wrong...? — Viktor Mellgren, Jan 21 '14 at 18:33
It doesn't, because Haskell is lazy, but even if it did, why would it matter? The work has to be done sometime. — Ganesh Sittampalam, Jan 21 '14 at 18:37
Since I'm not very good with haskell and looking for performance problems, my guess was that that would be the place where the problem was, maybe something with tail recursion, I don't know. I have made another function that is faster that uses list comprehension, but my guess would be that this would be faster since I do alot of other things such as guards and I have no bounds on the primes in the other version (it checks all(!) combinations) — Viktor Mellgren, Jan 21 '14 at 18:50
I think you need to make it clearer what your question is actually about - e.g. is it about why is there a performance difference to your other code (and if so give that other code and details of the measurements), is there a faster way of writing the above code, or what? — Ganesh Sittampalam, Jan 21 '14 at 18:54
@GaneshSittampalam: Yes, are there any problems with the way I wrote it? And is there a faster way? — Viktor Mellgren, Jan 21 '14 at 18:56

Bergi · Accepted Answer · 2014-01-22T17:18:10.277

I assume that map (x:) gives a problem performance wise

No. map is coded efficiently and runs in linear time, no problems here.

However, your recursion might be a problem. You're both calling sublistofsize (n-1) xs and sublistofsize n xs, which - given a start list sublistofsize m (_:_:ys) - does evaluate the term sublistofsize (m-1) ys twice, as there is no sharing between them in the different recursive steps.

So I'd apply dynamic programming to get

subsequencesOfSize :: Int -> [a] -> [[a]]
subsequencesOfSize n xs = let l = length xs
                          in if n>l then [] else subsequencesBySize xs !! (l-n)
 where
   subsequencesBySize [] = [[[]]]
   subsequencesBySize (x:xs) = let next = subsequencesBySize xs
                             in zipWith (++) ([]:next) (map (map (x:)) next ++ [[]])

Not that appending the empty lists is the most beautiful solution, but you can see how I have used zipWith with the displaced lists so that the results from next are used twice - once directly in the list of subsequences of length n and once in the list of subsequences of length n+1.

Testing it in GHCI with :set +s, you can see how this is drastically faster than the naive solutions:

*Main> length $ subsequencesOfSize 7 [1..25]
480700
(0.25 secs, 74132648 bytes)
(0.28 secs, 73524928 bytes)
(0.30 secs, 73529004 bytes)
*Main> length $ sublistofsize 7 [1..25] -- @Vixen (question)
480700
(3.03 secs, 470779436 bytes)
(3.35 secs, 470602932 bytes)
(3.14 secs, 470747656 bytes)
*Main> length $ sublistofsize' 7 [1..25] -- @Ganesh
480700
(2.00 secs, 193610388 bytes)
(2.00 secs, 193681472 bytes)
*Main> length $ subseq 7 [1..25] -- @user5402
480700
(3.07 secs, 485941092 bytes)
(3.07 secs, 486279608 bytes)

score 2 · Answer 2 · answered Jan 21 '14 at 19:18

Your implementation is the natural "Haskell-ish" one for that problem.

If you end up using the entire result, then there won't be anything asymptotically faster for this problem given the output datastructure ([[a]]) because it runs in time linear in the length of the output.

The use of map (x:) is a very natural way to add an element onto the start of each list and there's unlikely to be any significantly faster options given that we are working with lists.

In principle the repeated use of (++) is inefficient as it causes the left-hand argument to be traversed each time it is called, but the total cost in this case should only be an extra constant factor.

You might be able to improve it with the use of an accumulating parameter otherResults to collect the results, but to make this change you also need to pass down prefix in reversed order and re-reverse it at the end, which could well eat up the savings:

sublistofsize' 0 _        prefix otherResults = reverse prefix : otherResults
sublistofsize' _ []       prefix otherResults = otherResults
sublistofsize' n (x : xs) prefix otherResults =
   sublistofsize' (n-1) xs (x:prefix) (sublistofsize' n xs prefix otherResults)

sublistofsize n xs = sublistofsize' n xs [] []

Thank you, the one with the accumulator is about 2 times faster (I changed the code to not reverse prefix, since i don't need it in order — Viktor Mellgren, Jan 21 '14 at 20:09
Actually, just occurred to me that it'd be more efficient still to reverse the input list first instead of reversing each output list, at least if you don't care about the specific order of results. — Ganesh Sittampalam, Jan 21 '14 at 20:10

score 2 · Answer 3 · answered Jan 21 '14 at 20:21

2

An optimization which should help is to keep track of whether there are enough elements in the list to form the rest of the subsequence. This can be done very efficiently by keeping track of a pointer which is n-1-elements ahead of xs and advancing them both as you recurse.

An implementation:

  nthtail 0 xs = xs
  nthtail _ [] = []
  nthtail n (x:xs) = nthtail (n-1) xs

  subseq 0 _ = [[]]
  subseq n xs =
    if null t
      then []
      else go n xs t
    where
      t = nthtail (n-1) xs  -- n should always be >= 1 here
      go 0 _ _  =  [[]]
      go _ _ [] = []
      go n xs@(x:xt) t = withx ++ withoutx
        where withx = map (x:) $ go (n-1) xt t
              withoutx = go n xt (tail t)

answered Jan 21 '14 at 20:21

ErikR

51,541
9
73
124

1

`nthtail = drop`? :-) – Bergi Jan 22 '14 at 15:07
@Bergi: The only difference is that nthtail takes any n that is (Num a), (which is not correct if supplied with a float. – Viktor Mellgren Jan 23 '14 at 18:47
supports Integer (just as genericDrop) and also matches x in (x:xs), could be (_:xs) – Viktor Mellgren Jan 23 '14 at 18:56

score 1 · Answer 4 · answered Jan 27 '20 at 13:57

This is a 6 years old topic but i believe i have a code worth sharing here.

The accepted answer by @Bergi is just super but still i think this job can be done better as seen from two aspects;

Although not mentioned in any of the specifications, it returns combinations in reverse lexicographical order. One might like to have them in lexicographical order as it is mostly the case.
When tested with C(n,n/2) they perform similar however when tested like C(100,5) the following code is much faster and more memory efficient.

.

combinationsOf :: Int -> [a] -> [[a]]
combinationsOf 1 as        = map pure as
combinationsOf k as@(x:xs) = run (l-1) (k-1) as $ combinationsOf (k-1) xs
                             where
                             l = length as

                             run :: Int -> Int -> [a] -> [[a]] -> [[a]]
                             run n k ys cs | n == k    = map (ys ++) cs
                                           | otherwise = map (q:) cs ++ run (n-1) k qs (drop dc cs)
                                           where
                                           (q:qs) = take (n-k+1) ys
                                           dc     = product [(n-k+1)..(n-1)] `div` product [1..(k-1)]

Lets compare them against the test case under the accepted answer.

*Main> length $ subsequencesOfSize 7 [1..25]
480700
(0.27 secs, 145,572,672 bytes)

*Main> length $ combinationsOf 7 [1..25]
480700
(0.14 secs, 95,055,360 bytes)

Let us test them against something harder like C(100,5)

*Main> length $ subsequencesOfSize 5 [1..100]
75287520
(52.01 secs, 77,942,823,360 bytes)

*Main> length $ combinationsOf 5 [1..100]
75287520
(17.61 secs, 11,406,834,912 bytes)

subsequences of length n from list performance

4 Answers4

Linked