2

I am about to start developing an application in Haskell that requires some Unicode support.

How to perform Unicode pattern matching in Haskell? I saw the GHC's syntax extension. But is there any language level support to perform this (without needed GHC's special extension)?

I saw this question but the answer given there uses an extension-based approach. Also what is the best Haskell library to work with Unicode? Bytestring or Text? What are the advantages and disadvantages of both?

Community
  • 1
  • 1
Tem Pora
  • 2,043
  • 2
  • 24
  • 30
  • I'm confused, GHC's extension let's *code* be unicode and Text is for *data* which is unicode... is your code or data unicode? – daniel gratzer Jun 05 '13 at 17:56
  • 5
    Haskell source is unicode. That extension just lets you use some unicode to replace built in operators. – Philip JF Jun 05 '13 at 18:01

1 Answers1

4

As far as I can tell, pattern matching on Unicode characters works out of the box. Try this:

f ('薬':rest) = rest
f _           = "Your string doesn't begin with 薬"

main = do
  putStrLn (f "薬は絶対飲まへん!")
  putStrLn (f "なぜ?死にたいのか?")

Regarding libraries, you definitely want Text rather than ByteString, the reason being that Text is actually meant for working with text, counting the length of strings by character rather than by byte and so on, whereas ByteString is just an immutable array of bytes with a few extra frills, more suitable for storing and transmitting binary data.

As for pattern matching on ByteString, Text, etc., it's simply not possible without extensions since they are opaque types with deliberately hidden implementations. You can, however, pattern match on characters with many higher order functions that operate on on Text/ByteString:

import Data.Text as T

countTs n 't' = n+1
countTs n 'T' = n+1
countTs n _   = n

main = do
  putStr "Please enter some text> "
  str <- T.pack `fmap` getLine
  let ts = T.foldl countTs 0 str
  putStrLn ("Your text contains " ++ show ts ++ " letters t!")

I wouldn't worry about using extensions if I were you though. GHC is simply the Haskell compiler, so it's highly unlikely that you'll ever need to compile your code using anything else.

valderman
  • 8,365
  • 4
  • 22
  • 29
  • 5
    Clarification: You can pattern match on Text (and ByteString) literals using the `OverloadedStrings` extension. What you cannot do is use pattern matching to deconstruct a Text character by character like you can with String. – hammar Jun 05 '13 at 18:22
  • When I tried, `putStrLn ("薬")` I got the error `*** Exception: : hPutChar: invalid argument (invalid character)` What is the reason for this? This might be a Windows specific issue. I will try it on Linux later. – Tem Pora Jun 05 '13 at 18:34
  • That sounds like a Windows issue to me. At least, it works just fine on Linux. – valderman Jun 05 '13 at 18:58
  • See [this question](http://stackoverflow.com/questions/7511393/haskell-output-non-ascii-characters)--apparently Windows GHCi does not handle unicode gracefully (`putStrLn "薬"`works fine for me on OS X). – isturdy Jun 05 '13 at 19:02
  • 1
    Nice example for using 関西弁. – Overmind Jiang Jun 06 '13 at 04:01