37

Using Parsec 3.1, it is possible to parse several types of inputs:

  • [Char] with Text.Parsec.String
  • Data.ByteString with Text.Parsec.ByteString
  • Data.ByteString.Lazy with Text.Parsec.ByteString.Lazy

I don't see anything for the Data.Text module. I want to parse Unicode content without suffering from the String inefficiencies. So I've created the following module based on the Text.Parsec.ByteString module:

{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}

module Text.Parsec.Text
    ( Parser, GenParser
    ) where

import Text.Parsec.Prim

import qualified Data.Text as T

instance (Monad m) => Stream T.Text m Char where
    uncons = return . T.uncons

type Parser = Parsec T.Text ()
type GenParser t st = Parsec T.Text st
  1. Does it make sense to do so?
  2. It this compatible with the rest of the Parsec API?

Additional comments:

I had to add {-# LANGUAGE NoMonomorphismRestriction #-} pragma in my parse modules to make it work.

Parsing Text is one thing, building an AST with Text is another thing. I will also need to pack my String before return:

module TestText where

import Data.Text as T

import Text.Parsec
import Text.Parsec.Prim
import Text.Parsec.Text

input = T.pack "xxxxxxxxxxxxxxyyyyxxxxxxxxxp"

parser = do
  x1 <- many1 (char 'x')
  y <- many1 (char 'y')
  x2 <- many1 (char 'x')
  return (T.pack x1, T.pack y, T.pack x2)

test = runParser parser () "test" input
recursion.ninja
  • 5,377
  • 7
  • 46
  • 78
gawi
  • 13,940
  • 7
  • 42
  • 78

3 Answers3

22

Since Parsec 3.1.2 support of Data.Text is built-in! See http://hackage.haskell.org/package/parsec-3.1.2

If you are stuck with older version, the code snippets in other answers are helpful, too.

Zouppen
  • 1,214
  • 11
  • 17
  • As far as I can tell, this only supports `Text` **input** (the first half of the question). `many` with `char`, `noneOf` and friends still only return `String` so you still need to `pack` yourself if you want `Text` **output** (second half of question), as described [here](https://stackoverflow.com/questions/26142294/using-data-text-with-parsec#comment40978337_26142294). Please correct if I'm wrong! – Heath Raftery Jan 08 '23 at 05:04
11

That looks like exactly what you need to do.

It should be compatible with the rest of Parsec, include the Parsec.Char parsers.

If you're using Cabal to build your program, please put an upper bound of parsec-3.1 in your package description, in case the maintainer decides to include that instance in a future version of Parsec.

scravy
  • 11,904
  • 14
  • 72
  • 127
Antoine Latter
  • 1,545
  • 10
  • 13
  • It's working OK except for the `Text.Parsec.Language` and `Text.Parsec.Token` modules which are restricted to `String`. I can work around that problem by performing my own tokenization. `Text.Parsec.Language` is just a gadget anyway (Mondrian? anyone?). – gawi Nov 02 '10 at 01:17
  • Ah! I wonder if we can generalize those to any Char stream in a backwards compatible way. It doesn't look hard, but since I never use those modules I don't have any good test-cases. – Antoine Latter Nov 04 '10 at 16:04
5

I added a function parseFromUtf8File to help reading UTF-8 encoded files in an efficient fashion. Works flawlessly with umlaut characters. Function type matches parseFromFile from Text.Parsec.ByteString. This version uses strict ByteStrings.

-- A derivate work from
-- http://stackoverflow.com/questions/4064532/using-parsec-with-data-text

{-# LANGUAGE FlexibleInstances, MultiParamTypeClasses #-}
{-# OPTIONS_GHC -fno-warn-orphans #-}

module Text.Parsec.Text
    ( Parser, GenParser, parseFromUtf8File
    ) where

import Text.Parsec.Prim
import qualified Data.Text as T
import qualified Data.ByteString as B
import Data.Text.Encoding
import Text.Parsec.Error

instance (Monad m) => Stream T.Text m Char where
    uncons = return . T.uncons

type Parser = Parsec T.Text ()
type GenParser t st = Parsec T.Text st

-- | @parseFromUtf8File p filePath@ runs a strict bytestring parser
-- @p@ on the input read from @filePath@ using
-- 'ByteString.readFile'. Returns either a 'ParseError' ('Left') or a
-- value of type @a@ ('Right').
--
-- >  main    = do{ result <- parseFromFile numbers "digits.txt"
-- >              ; case result of
-- >                  Left err  -> print err
-- >                  Right xs  -> print (sum xs)
-- >              }
parseFromUtf8File :: Parser a -> String -> IO (Either ParseError a)
parseFromUtf8File p fname = do 
  raw <- B.readFile fname
  let input = decodeUtf8 raw
  return (runP p () fname input)
scravy
  • 11,904
  • 14
  • 72
  • 127
Zouppen
  • 1,214
  • 11
  • 17