Regex & String Libraries in Haskell

Question

I'm trying to introduce Haskell into my daily life by using it to write incidental scripts and such.

readProcess is handy for getting the results of exterior commands, but I find myself searching when it comes to processing the String results. I'm coming from ruby where regexes are first-class, so I'm used to having them as a tool.

Any libraries I should read up on to do string processing in haskell? Searching for matching lines, pulling out matching regions of a string, and such?

you can find great survey on haskell wiki: http://www.haskell.org/haskellwiki/Regular_expressions — max taldykin, Dec 10 '10 at 15:09

score 5 · Accepted Answer · edited Jan 04 '14 at 23:27

I found this to be a good starting point: http://www.serpentine.com/blog/2007/02/27/a-haskell-regular-expression-tutorial/ It only covers the basics, no advanced topics, but it's great to get started IMHO.

Things to note:

Regexes in haskell are different in that they have overloaded return types. This means that you can pull many different kinds of thing out of a regex match. (Bool, String, [String], etc...) Depending on the return type you use, it will give you back a different kind of answer (whether or not the regex matched, the test of the match, all matching subgroups, etc..) This is done using some fairly complex typeclass voodoo. The above link demonstrates the basic kinds, a more complete list is here
There are actually multiple standard modules in haskell that provide regex support (strange but true). The tutorial above shows the POSIX module, because it comes standard in haskell. If you have cabal, you can also pretty easily install other regex modules and use those instead. There's a pcre binding (regex-pcre), as well as some packages that work via DFAs (regex-dfa, among others). Install using a command like: cabal install regex-pcre and you should be good to go.
- (The modules have a standardized interface, the difference is mainly in the implementation and the regex flavor)
There IS a regex object in haskell, but you don't really need it to use the =~ or =~~ match operators. (Just use a string, conversion happens automatically). If your task is complicated enough that you want a first class parsing object, consider looking into Parsec as has been mentioned in other answers.

DISCLAIMER: I only really user pcre, myself, so I don't really know much about the other packages.

score 4 · Answer 2 · answered Dec 10 '10 at 22:58

When I was first teaching myself Haskell I found that learning to use a parser combinator library for string processing was a fantastic investment. They can do everything regular expressions can do, and much more, and writing combinator parsers is a great way to build up intuitions about type classes like monads, applicative functors, etc.

I tend to use Attoparsec these days, but Parsec is probably a better starting point because it's more widely documented and discussed, provides nicer error messages, etc.

I haven't yet reached the "Parsec" chapter, I am looking forward to it, though. :) — Marcus Borkenhagen, Dec 10 '10 at 23:06

Marcus Borkenhagen · Answer 3 · 2010-12-10T20:18:18.027

3

A good introduction to regular expressions is to be found in Realworld Haskell

Update: On a side note, for command-processing and pipes and such, checkout HSH.

edited Dec 10 '10 at 20:18

answered Dec 10 '10 at 15:02

Marcus Borkenhagen

6,536
1
30
33

score 0 · Answer 4 · answered Dec 10 '10 at 18:23

0

There are plenty of great regex libs in Haskell, but we have better tools. Let's stick with standard Haskell Strings for now (i.e. lists of Char). The basics are all in Data.List -- http://www.haskell.org/ghc/docs/latest/html/libraries/base-4.3.0.0/Data-List.html. You have lines, unlines, words, unwords, takewhile, dropwhile, etc.etc. Also isPrefixOf and isInfixOf, etc.

You may end up writing your own recursive functions fairly directly, but that's a breeze too. The only really missing operations are splitting ones, for which you can use brent's excellent package: http://hackage.haskell.org/package/split

Fundamentally, the notion is that you want to do incremental processing of streams of characters.

Not everything is as efficient as possible, especially since the string representation is not that efficient. But if/when you move on to other data types, the core concepts of how you process things will translate directly from basic strings.

answered Dec 10 '10 at 18:23

sclv

38,665
7
99
204

Using `isPrefixOf` on `tails` of a string isn't _better_ though, it's worse. It's more of a PITA to write, and it's slower then a good string matcher. – rampion Dec 10 '10 at 18:31
As I said "not everything is efficient as possible". `isInfixOf` is indeed what I was referring to. Its hardly "more of a PITA" to write however. And as I said, the core concept translates straightforwardly. – sclv Dec 10 '10 at 19:50
1

@rampion: i should also add that if you're at the point where you're concerned with a "good string matcher" you shouldn't be using `[Char]` at all -- Data.Text has a good matcher out of the box, and there's an excellent substring search package for bytestrings as well. – sclv Dec 13 '10 at 15:49

Regex & String Libraries in Haskell

4 Answers4

Linked