I have two versions of a Haskell program which counts the occurrences of each word inside a .txt file.
The first one is:
import Data.HashMap.Strict (empty, insertWith, toList)
import Data.Text (pack, toLower, filter, words)
import System.IO
wordcount = withFile "input.txt" ReadMode $ \handle -> do
content <- hGetContents handle
print $ toList $ foldr
(\x v -> insertWith (+) x 1 v)
empty
(fmap Data.Text.toLower
$ fmap (Data.Text.filter isAlphaNum)
$ (Data.Text.words . pack) content)
The second one exploits Conduit library:
import Data.HashMap.Strict (empty, insertWith, toList)
import Data.Char (isAlphaNum, toLower)
import Conduit
import qualified Data.Conduit.Combinators as CC
wordcountC = do
hashMap <- runConduitRes $ sourceFile "input.txt"
.| decodeUtf8C
.| omapCE Data.Char.toLower
.| CC.splitOnUnboundedE (not . isAlphaNum)
.| foldMC insertInHashMap empty
print (toList hashMap)
Running each function several times with a large input file (approx. 80 MB), I measured that there isn't difference between the execution times of the two versions (approximately 13 seconds for both the "standard" version and the Conduit one). Shouldn't the second version benefit from Conduit's stream processing paradigm, therefore resulting in smaller execution time?