23

The commonly recommended Haskell string types seem to be ByteString or Text. I often work with a very large number of short (English word sized) strings, and typically need to store them in a lookup table such as Data.Map. In many cases I find that in this scenario, a table of Strings can take up less memory then a table of ByteStrings. Unboxed Data.Vectors of Word8 are also (much) more compact than ByteStrings.

What is the best practice when one needs to store and compare large numbers of small strings in Haskell?

Below I have tried to condense a particular problematic case into a small example:

import qualified Data.ByteString.Lazy.Char8 as S
import qualified Data.ByteString as Strict
import qualified Data.Map as Map
import qualified Data.Vector.Unboxed as U
import qualified Data.Serialize as Serialize
import Control.Monad.State

main =   putStr 
  . unlines . map show . flip evalState (0,Map.empty) 
  . mapM toInt 
  . S.words
  =<<
  S.getContents


toInt x = do  
  let x' =   
          U.fromList . Strict.unpack .  -- Comment this line to increase memory usage
           Serialize.encode $ x  
  (i,t) <- get
  case Map.lookup x' t of
    Just j -> return j
    Nothing -> do 
      let i' = i + (1::Int)
      put (i', Map.insert x' i t)
      return i

When I run this on a file containing around 400.000 words of English text, the version with strict bytestring keys uses around 50MB memory, the one with Word8 vectors uses 6MB.

Grzegorz Chrupała
  • 3,053
  • 17
  • 24
  • What are you using those strings for ? Is it some kind of dictionary ? – ARRG Feb 22 '12 at 16:31
  • 6
    Can you give some code example where ByteStrings take up more memory than Strings or "much more" memory than Word8 vectors? I can't figure why that would be the case unless you are doing something strange. – shang Feb 22 '12 at 16:40
  • 7
    @shang: I can imagine this happening if you're making the mistake of comparing the size of a map full of strict ByteStrings to a map containing string thunks. Though more details would be helpful. A short test program demonstrating the issue would be especially nice. – hammar Feb 22 '12 at 16:43
  • 1
    @hammar: Yeah, that's one option. Another might be that you are slicing the words from a large ByteString and retaining references to it. – shang Feb 22 '12 at 17:06
  • 4
    You might want to have a look at [Data.Trie](http://hackage.haskell.org/packages/archive/bytestring-trie/0.1.4/doc/html/Data-Trie.html). – danr Feb 22 '12 at 17:10
  • Any chance you are processing the pieces of strings before stuffing them in a Map? When using ByteString, the GHC runtime will maintain originally read bodies of text along with pointers even if you processed them down to single words. This would really bloat the memory if you're processing, say, a series of paragraphs into a few words each. All the paragraphs in their entirety would be retained in memory! – ozataman Feb 22 '12 at 23:30
  • @ozataman: if this is the case, what is the recommended solution? It would seem like a rather common use case for Text/ByteString. – Grzegorz Chrupała Feb 23 '12 at 10:54
  • @Grzegorz: `Data.ByteString.copy` will copy into a new bytestring, so if the old memory is now unreferenced it can be GC'd. – Ben Millwood Mar 01 '12 at 14:44
  • 2
    Also see short bytestrings https://hackage.haskell.org/package/bytestring-0.10.4.0/docs/Data-ByteString-Short.html – ron Mar 26 '15 at 09:09

3 Answers3

5

In the absence of other answers, I'm going to go out on a limb here.

What is the best practice when one needs to store and compare large numbers of small strings in Haskell?

If the small strings are meant to be human readable (e.g. an English word) then use Text. If they are meant to be read only by the computer, use ByteString. The decision to use strict or lazy variants of these depends on how you build and use these small strings.

You shouldn't need to use your own unboxed Vectors of Word8. If you experiencing a specific situation where regular String is faster than Text or ByteString, then throw the details up on StackOverflow and we'll try to figure out why. If you perform detailed analysis and can prove that an unboxed Vector of Word8 consistently works significantly better than Text or ByteString, then start conversations on mailing lists, irc, reddit, etc; the standard libraries are not set in stone, and improvements are always welcome.

But I think it highly likely that you are just doing something weird, as hammar and shang suggest.

P.S. for your particular use case, instead of storing a lot of small strings, you should consider a more appropriate data structure catered to your needs, e.g. a Trie as danr suggests.

Dan Burton
  • 53,238
  • 27
  • 117
  • 198
  • 4
    Sorting _short_ strings is one place where a regular `String` performs better than `ByteString` (I don't know about `Text`, but I wouldn't be surprised if `String` beats that too for this task). Why that is is obvious: `ByteString` uses a counting sort. – Daniel Fischer Feb 22 '12 at 23:16
3

I know this is a 6-year old post, but I was wondering the same recently, and found this useful blog post: https://markkarpov.com/post/short-bs-and-text.html. It seems that yes, this is a recognized issue, and Short(Text/ByteString) are the solution.

iustin
  • 343
  • 1
  • 11
3

A (strict) ByteSting is a constructor over an unboxed ForiegnPtr to a Word8 and two unboxed Ints.

A ForeignPtr is another constructor over an Addr# (a GHC prim) and a ForeignPtrContents:

data ForeignPtrContents
  = PlainForeignPtr !(IORef (Finalizers, [IO ()]))
  | MallocPtr      (MutableByteArray# RealWorld) !(IORef (Finalizers, [IO ()]))
  | PlainPtr       (MutableByteArray# RealWorld)

...

For short strings, ByteStrings simply pack too much administration to benefit their contiguous representation of the actual "string" data.

For the original question - I'd check an average word length of your corpus, but I can't see ByteString being more efficient than String aka [Char] which uses 12 bytes per Char (source the original ByteString paper).

A general plea to Haskellers (not aimed the poster of the original question) - please stop bashing String aka [Char] - having both String and Text (and ByteString when you really need bytes) makes sense. Or use Clean where the contiguous String representation is better suited to short strings.

Caveat - I may have been looking at an old version of the ByteString internals with regards to what data types it uses internally.

stephen tetley
  • 4,465
  • 16
  • 18