Whenever I consider learning a new language -- haskell in this case -- I try to hack together a primitive grep clone to see how good the language implementation and/or its libraries are at text processing, because that's a major use case for me.
Inspired by code on the haskell wiki, I came up with the following naive attempt:
{-# LANGUAGE FlexibleContexts, ExistentialQuantification #-}
import Text.Regex.PCRE
import System.Environment
io :: ([String] -> [String]) -> IO ()
io f = interact (unlines . f . lines)
regexBool :: forall r l .
(RegexMaker Regex CompOption ExecOption r,
RegexLike Regex l) =>
r -> l -> Bool
regexBool r l = l =~ r :: Bool
grep :: forall r l .
(RegexMaker Regex CompOption ExecOption r, RegexLike Regex l) =>
r -> [l] -> [l]
grep r = filter (regexBool r)
main :: IO ()
main = do
argv <- getArgs
io $ grep $ argv !! 0
This appears to be doing what I want it to, but unfortunately, it's really slow -- about 10 times slower than a python script doing the same thing. I assume it's not the regex library that's at fault here, because it's calling into PCRE which should be plenty fast (switching to Text.Regex.Posix
slows things down quite a bit further). So it must be the String
implementation, which is instructive from a theoretical point of view but inefficient according to what I've read.
Is there an alternative to String
s in haskell that's both efficient and convenient (i.e. there's little or no friction when switching to using that instead of String
s) and that fully and correctly handles UTF-8-encoded Unicode, as well as other encodings without too much hassle if possible? Something that everybody uses when doing text processing in haskell but that I just don't know about because I'm a complete beginner?