6

My problem is that when using any of the Map implementations in Haskell that I always get a "Stack space overflow" when working with a million values.

What I'm trying to do is process a list of pairs. Each pair contains two Ints (not Integers, I failed miserably with them so I tried Ints instead). I want to go through each pair in the list and use the first Int as a key. For each unique key I want to build up a list of second elements where each of the second elements are in a pair that have the same first element. So what I want at the end is a "Map" from an Int to a list of Ints. Here's an example.

Given a list of pairs like this:

[(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)]

I would like to end up with a "Map" like this:

{1 : [42,79,10], 2 : [11], 3 : [18,99]}

(I'm using a Python-like notation above to illustrate a "Map". I know it ain't Haskell. It's just there for illustrative purposes.)

So the first thing I tried was my own hand built version where I sorted the list of pairs of Ints and then went through the list building up a new list of pairs but this time the second element was a list. The first element is the key i.e. the unique Int values of the first element of each pair and the second element is a list of the second values of each original pair which have the key as the first element.

So given a list of pairs like this:

[(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)]

I end up with a list of pairs like this:

[(1, [42,79,10], (2, [11]), (3, [18,99])]

This is easy to do. But there is one problem. The performance of the "sort" function on the original list (of 10 million pairs) is shockingly bad. I can generate the original list of pairs in less than a second. I can process the sorted list into my hand built map in less than a second. However, sorting the original list of pairs takes 40 seconds.

So I thought about using one of the built-in "Map" data structures in Haskell to do the job. The idea being I build my original list of pairs and then using standard Map functions to build a standard Map.

And that's where it all went pear-shaped. It works well on a list of 100,000 values but when I move to 1 million values, I get a "Stack space overflow" error.

So here's some example code that suffers from the problem. Please, please note that is not the actual code that I want to implement. It is just a very simplified version of code for which the same problem exists. I don't really want to separate a million consecutive numbers into odd and even partitions!!

import Data.IntMap.Strict(IntMap, empty, findWithDefault, insert, size)

power = 6

ns :: [Int]
ns = [1..10^power-1]

mod2 n = if odd n then 1 else 0

mod2Pairs = zip (map mod2 ns) ns

-- Takes a list of pairs and returns a Map where the key is the unique Int values
-- of the first element of each pair and the value is a list of the second values
-- of each pair which have the key as the first element.
-- e.g. makeMap [(1,10),(2,11),(1,79),(3,99),(1,42),(3,18)] = 
--      1 -> [42,79,10], 2 -> [11], 3 -> [18,99]
makeMap :: [(Int,a)] -> IntMap [a]
makeMap pairs = makeMap' empty pairs
  where
    makeMap' m [] = m
    makeMap' m ((a, b):cs) = makeMap' m' cs
      where
        bs = findWithDefault [] a m
        m' = insert a (b:bs) m

mod2Map = makeMap mod2Pairs

main = do
  print $ "Yowzah"
  print $ "length mod2Pairs="++ (show $ length mod2Pairs)
  print $ "size mod2Map=" ++ (show $ size mod2Map)

When I run this, I get:

"Yowzah"
"length mod2Pairs=999999"
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

From the above output, it should be clear that the stack space overflow happens when I try to do "makeMap mod2Pairs".

But to my naive eye all this seems to do is go through a list of pairs and for each pair lookup a key (the first element of each pair) and A) if it doesn't find a match return an empty list or B) if it does find a match, return the list that has previously been inserted. In either case it "cons"'s the second element of the pair to the "found" list and inserts that back into the Map with the same key.

(PS instead of findWithDefault, I've also tried lookup and handled the Just and Nothing using case but to no avail.)

I've had a look through the Haskell documentation on the various Map implementations and from the point of view of performance in terms of CPU and memory (especially stack memory), it seems that A) a strict implementation and B) one where the keys are Ints would be the best. I have also tried Data.Map and Data.Strict.Map and they also suffer from the same problem.

I am convinced the problem is with the "Map" implementation. Am I right? Why would I get a stack overflow error i.e. what is the Map implementation doing in the background that is causing a stack overflow? Is it making lots and lots of recursive calls behind the scenes?

Can anyone help explain what is going on and how to get around the problem?

MicMac
  • 285
  • 1
  • 6
  • 1
    Your code as given works without problems for me (GHC 7.10.2 or 8.2.1, both in interpreted mode and when compiled with or without optimisations), printing `size mod2Map=2`. – leftaroundabout Aug 15 '18 at 23:28
  • 4
    BTW, I appreciate the effort you put into writing the question but frankly it would be more useful to _get to the point_ a bit quicker. – leftaroundabout Aug 15 '18 at 23:28
  • I'm using "The Glorious Glasgow Haskell Compilation System, version 7.6.3" so maybe I need to upgrade. I'll let you know what happens when I do. Have you previously changed the stack space size as the compiler suggests "Use `+RTS -Ksize -RTS' to increase it."? – MicMac Aug 15 '18 at 23:33
  • PS regarding the BTW, do you mean all the stuff before "My problem is that when using any..."? If so, that study in passive aggression is just my latest attempt in trying to get a question answered! Thanks anyway. – MicMac Aug 15 '18 at 23:39
  • 7.6.3 is the standard version available on my version of Linux Mint (with apt-get). Given that I should really be doing things the right way i.e. with the fromListWith option suggested by Daniel below, I'm not going to bother with upgrading. – MicMac Aug 16 '18 at 00:18
  • 7.6.3 is rather old, having been released in April of 2013. The way the stack is handled changed in 7.8, as I recall. You should probably install version 8.4. hvr's PPA has a very good reputation, or you can install it manually if you really want. – dfeuer Aug 16 '18 at 03:21

1 Answers1

5

I don't have an old enough GHC to check (this works just fine for me, and I don't have 7.6.3 as you do), but my guess would be that your makeMap' is too lazy. Probably this will fix it:

makeMap' m ((a, b):cs) = m `seq` makeMap' m' cs

Without it, you are building up a million-deep nested thunk, and deeply-nested thunks is the traditional way to cause stack overflows in Haskell.

Alternately, I would try just replacing the whole makeMap implementation with fromListWith:

makeMap pairs = fromListWith (++) [(k, [v]) | (k, v) <- pairs]
Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380
  • Go raibh míle maith agat. Both of those suggestions worked a treat. I get the fromListWith concept. Obviously someone smart already thought you might want to do more than just replace the existing value in a map e.g. you might want to add another one to the existing value. And then they optimised it. The `seq` one seems a little bit like black magic and not really in the whole purist, functional programming, Haskell spirit but, hey, it works. So I'm conflicted. Thanks again! – MicMac Aug 16 '18 at 00:11
  • 1
    @MicMac [foldl vs foldl'](https://wiki.haskell.org/Foldr_Foldl_Foldl') - a classic Haskell problem which newer GHC apparently solves better due to more advanced strictness analysis (you do realise that 7.6.3 was released in 2013?). By the way you can also achieve the same result by changing `m` to `m!` with `BangPatterns` pragma on. – Ed'ka Aug 16 '18 at 00:42
  • 1
    I had not realised that the version of GHC I am using is from 2013. apt-get on the version of linux mint I'm using insists that it's the latest version. I have just downloaded and installed 8.4.3 from haskell.org. Everything suggested above runs at the same speed as in 7.6.3 (a bit worrying?) with the only difference being that my crappy makeMap doesn't have a stack overflow anymore. It just takes twice as long as the suggestions above. So without resorting to seq or m! which seem a bit like my old C/C++ days, I'm going to go with the fromListWith. – MicMac Aug 16 '18 at 01:36
  • 2
    @Ed'ka I doubt modern GHC actually does a better job of avoiding deeply-nested thunks or a better job of strictness analysis here. I place the blame on GHC's new ability to dynamically grow its stack -- in other words, I suspect the space leak is still there in modern GHCs, but instead of manifesting as a stack overflow it manifests as silently using more memory than necessary. – Daniel Wagner Aug 16 '18 at 02:37
  • @MicMac Try this: `!bs = findWithDefault [] a m` it reduced memory use in my experiments from 300Mb to 150MB. And just try running your program with `+RTS -s` option to get quick insight of memory use. Also make sure it isn't compiled with `-threaded` option (which did increase running time by the factor of two - see if there is any `SPARKS` in your `+RTS -s` output). I doubt that newer GHC can make your program run slower :) – Ed'ka Aug 16 '18 at 02:44
  • @DanielWagner I still think that [Strictness analysis](https://wiki.haskell.org/Performance/Strictness#Strictness_analysis) matters here: try compiling the original program with `ghc -O0 -rtsopts` and then running it with `+RTS -K1m` (to limit the stack growth) - it stack overflows. Recompile it with `ghc -O2 -rtsopts` and it doesn't. Or could be something else which `-O2` adds to the picture? – Ed'ka Aug 16 '18 at 03:21
  • 2
    @MicMac: `seq` seems like black magic until it seems obvious—``a `seq` b`` just makes it explicit to the runtime that the evaluation of `b` depends on the evaluation of `a`, even though there’s no other connection between them such as a data dependency or `case` expression. – Jon Purdy Aug 16 '18 at 04:56
  • @Ed'ka Right, but I don't think this situation has *improved* since 7.6, or at least not significantly. I expect even with 7.6, turning on `-O2` would have solved the problem entirely. – Daniel Wagner Aug 16 '18 at 13:15
  • 1
    @DanielWagner You are right, I tested and it indeed does not stack overflow with `-O2` on 7.6.3. Which means @MicMac was running unoptimised version all along (not recommended). Usually this kind of questions are immediately met with *Did you compile it with optimisation turned on?* comment. Unfortunately this one wasn't :) – Ed'ka Aug 17 '18 at 00:14