Process large lists in Haskell into a single value

Question

I have 2 lists of Int of equal size (roughly 10,000 elements): say x and y. I need to compute a product of the following expression for each corresponding pair of elements from the lists: x_i/(x_i+y_i), i.e. x_i and y_i are the first elements of the lists, then second, etc.

My approaches work fine on small test cases, but ghci hangs for the larger lists. Any insight as to the cause and solution would be appreciated.

I tried to do this with fold, zipping the lists first:

getP:: [Int] -> [Int] -> Double
getP zippedCounts = foldr (\(x,y) acc -> let intX = fromIntegral x
                                             intY = fromIntegral y
                                             intSum = intX + intY in
                                             (acc*(intX/intSum))) 
                          1.0
                          zippedCounts

I also tried recusion:

getP lst [] = 1.0
getP [] lst = 1.0
getP (h1:t1) (h2:t2) = ((fromIntegral h1) / ((fromIntegral h1) + (fromIntegral h2))) * (getP t1 t2)

As well as list comprehension:

getP lst1 lst2 = (product [((fromIntegral x) / ((fromIntegral x) + (fromIntegral y)))|x <- lst1, y <- lst2])

Have you tried with the strict version, `foldr'`, from `Data.Foldable`? — Mark Seemann, Oct 14 '18 at 09:29
The list comprehension does not do what you intend, it runs for every combination (multiplying N^2 factors instead of N). Changing the generator to `|(x,y) <- zip lst1 lst2]` yields almost instantaneous results for lists with 100k elements when i try it in ghci. How do you create the lists you are using? — Jonas Duregård, Oct 14 '18 at 10:40
@JonasDuregård The lists represent counts of word occurrences in a collection of texts. The texts are read in a do block, and processed through several functions until we get their vectorized representation. Counts are then computed from the collection of vectorized texts. I did try to use strict foldr and marking lists as strict with !, but to no avail. — A B, Oct 14 '18 at 11:10
If you run `getP [100000..] [1..100000]` or such, do you get a result? Perhaps it's the production of the input lists that fail? Try printing them or writing them to a file to make sure that is not the problem. — Jonas Duregård, Oct 14 '18 at 19:41
@JonasDuregård I do get almost instantaneous output for getP if I use in-memory data. Will need to figure out how to tame lazy evaluation. — A B, Oct 14 '18 at 20:10
@AndyPants Not exactly a beginner technique, but in my opinion the "proper" memory-efficient way of solving the problem starting from two files in disk would be to use a streaming library (I like "streaming") to create two effectful producers of values, zip them http://hackage.haskell.org/package/streaming-0.2.2.0/docs/Streaming-Prelude.html#v:zip into a single producer of tuples, and consume the combined stream with an Applicative fold library like "foldl" https://stackoverflow.com/a/50050772/1364288 The "foldl" library provides a generalization of the concept of strict left fold. — danidiaz, Oct 14 '18 at 20:57

danidiaz · Accepted Answer · 2018-10-14T14:36:02.867

All three solutions have space leaks, maybe that's what causes the unresponsiveness.

In Haskell, when reducing a big list to a single summary value, it is very easy to inadvertently cause space leaks if we never "look into" the intermediate values of the computation. We can end up with a gigantic tree of unevaluated thunks hiding behind a seemingly inoffensive single Double value.

The foldr example leaks because foldr never forces its accumulator to weak head normal form. Use the strict left foldl' instead (you will need to reorder some function arguments). foldl' should ensure that the intermediate Double values remain "small" and thunks don't accumulate.

The explicit recursion example is dangerous because it is not tail-recursive and for large lists can cause a stack overflow (we repeatedly put values in the stack, waiting for the next recursive call to complete). A solution would involve making the function tail-recursive by passing the intermediate result as an extra parameter, and putting a bang pattern on that parameter to ensure thunks don't accumulate.

The product example leaks because, unfortunately, neither the sum nor the product functions are strict. For large lists, it's better to use foldl' instead. (There's also a bug, as it has been mentioned in the comments.)

dopamane · Answer 2 · 2018-10-14T07:58:09.260

0

You could try zipWith followed by product:

getP :: [Int] -> [Int] -> Double
getP xs ys = product $ zipWith func xs ys
  where 
    func x y = let dx = fromIntegral x
                   dy = fromIntegral y
               in dx / (dx + dy)

I would avoid using explicit recursion and stick to library functions for speed. You could also use certain ghc flags to speed up the compiled code.

edited Oct 14 '18 at 07:58

answered Oct 14 '18 at 07:38

dopamane

1,315
2
14
27

Process large lists in Haskell into a single value

2 Answers2