The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.
What is the best practice when one needs to store and compare large numbers of small strings in Haskell?
Below I have tried to condense a particular problematic case into a small example:
import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State
main = putStr
. unlines . map show . flip evalState (0,Map.empty)
. mapM toInt
. S.words
=<<
S.getContents
toInt x = do
let x' =
U.fromList . Strict.unpack . -- Comment this line to increase memory usage
Serialize.encode $ x
(i,t) <- get
case Map.lookup x' t of
Just j -> return j
Nothing -> do
let i' = i + (1::Int)
put (i', Map.insert x' i t)
return i
When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.