6

Whenever I consider learning a new language -- haskell in this case -- I try to hack together a primitive grep clone to see how good the language implementation and/or its libraries are at text processing, because that's a major use case for me.

Inspired by code on the haskell wiki, I came up with the following naive attempt:

{-# LANGUAGE FlexibleContexts, ExistentialQuantification #-}

import Text.Regex.PCRE
import System.Environment

io :: ([String] -> [String]) -> IO ()
io f = interact (unlines . f . lines)

regexBool :: forall r l .
  (RegexMaker Regex CompOption ExecOption r,
   RegexLike Regex l) =>
  r -> l -> Bool
regexBool r l = l =~ r :: Bool

grep :: forall r l .
  (RegexMaker Regex CompOption ExecOption r, RegexLike Regex l) =>
  r -> [l] -> [l]
grep r = filter (regexBool r)

main :: IO ()
main = do
  argv <- getArgs
  io $ grep $ argv !! 0

This appears to be doing what I want it to, but unfortunately, it's really slow -- about 10 times slower than a python script doing the same thing. I assume it's not the regex library that's at fault here, because it's calling into PCRE which should be plenty fast (switching to Text.Regex.Posix slows things down quite a bit further). So it must be the String implementation, which is instructive from a theoretical point of view but inefficient according to what I've read.

Is there an alternative to Strings in haskell that's both efficient and convenient (i.e. there's little or no friction when switching to using that instead of Strings) and that fully and correctly handles UTF-8-encoded Unicode, as well as other encodings without too much hassle if possible? Something that everybody uses when doing text processing in haskell but that I just don't know about because I'm a complete beginner?

dlukes
  • 1,313
  • 16
  • 27
  • 7
    Use [Text](https://hackage.haskell.org/package/text-1.2.2.1/docs/Data-Text.html) – ErikR Jul 30 '16 at 15:56
  • 2
    Just wanted to point out that getting C-like speeds is possible, but it might take some effort. Have a look at __cgrep__ - http://awgn.github.io/cgrep/ – ErikR Jul 30 '16 at 16:27
  • 4
    `String` is a low-performance, lazy string, which is "fine" for basic short strings but unsuitable for serious text manipulation. `Text` is the high-performance type for Unicode text manipulation. (There's also `ByteString` which is _not_ for text but for byte sequences.) – chi Jul 30 '16 at 16:27
  • 2
    Are you compiling with optimizations? – amalloy Jul 30 '16 at 16:32
  • @amalloy: I wasn't aware of optimizations, thanks for pointing them out to me! Unfortunately, in this particular case, the difference seems to be negligible... – dlukes Jul 30 '16 at 20:11
  • @ErikR, @chi: Thanks for the breakdown and the pointers, cgrep looks well worth investigating! Would either of you perhaps take a shot at modifying my example code to work with `Text`? I'd be happy to accept that as an answer. I'm trying to do so myself but I can't get my program to typecheck, so I must be missing something :( – dlukes Jul 30 '16 at 20:16
  • i think you should try this at https://codereview.stackexchange.com/ – epsilonhalbe Jul 31 '16 at 12:24

1 Answers1

1

It's possible that the slow speed is caused by using the standard library's list type. I've often run into performance problems with it in the past.

It would be a good idea to profile your executable, to see where it spends its time: Tools for analyzing performance of a Haskell program. Profiling Haskell programs is really easy (compile with a switch and execute your program with an added argument, and the report is written to a text file in the current working directory).

As a side note, I use exactly the same approach as you when learning a new language: create something that works. My experience doing this with Haskell is that I can easily gain an order of magnitude or two in performance by profiling and making relatively simple changes (usually a couple of lines).

Community
  • 1
  • 1
runeks
  • 1,745
  • 2
  • 17
  • 26
  • Thanks for the tip, I didn't know haskell had such nice profiling support out of the box! Looks like I'm spending about two thirds of the time in `io` and one third in `regexBool` (see above), so no real surprises here -- I just need to make those calls faster... Which should by doable by using `Text` instead of `String` as pointed out by other commenters above; only trouble is, I haven't been able to get my program to typecheck with this modification so far. – dlukes Jul 31 '16 at 13:08
  • @dlukes - depends on which regex-library you are using - I think regex-heavy has `Text` support the others seem to be `String`/`ByteString` only but maybe with inlining this is good enough for your needs. – epsilonhalbe Jul 31 '16 at 13:21