Parsec and sequence of commaSep input

Question

I took the example below partially from SO and changed it to my needs. It almost fits, but what I want to do is that always the first string in the commaSep expr is parsed as identifier whilst all subsequent strings should be strings only.

Currently they are all parsed as Identifiers.

*Parser> parse expr "" "rd (isFib, test2, 100.1, ?BOOL)"
Right (FuncCall "rd" [Identifier "isFib",Identifier "test2",Number 100.1,Query "?BOOL"])

I have tried a number of solutions that in the end all would break down to parsing the whole input without using commaSep. Means I would have to ignore the structure and do something like

expr_parse = do
    name <- resvd_cmd
    char '('
    skipMany space
    worker <- ident
    char ','
    skipMany1 space
    args <- commaSep expr --not fully worked this out yet
    query <- theQuery
    skipMany space
    char ')'
    return (name, worker, args, query)

that looks less optimal and very clunky to me. Is there any way to refactor expr in the code below, achive what I need and keep it simple?

module Parser where

import Control.Monad (liftM)
import Text.Parsec
import Text.Parsec.String (Parser)
import Lexer
import AST

expr = ident <|>  astring <|> number <|> theQuery <|> callOrIdent

astring = liftM String stringLiteral <?> "String"

number = liftM Number float <?> "Number"

ident = liftM Identifier identifier <?> "WorkerName"

questionm :: Parser Char
questionm = oneOf "?"

theQuery :: Parser AST
theQuery = do first <- questionm
              rest <- many1 letter
              let query = first:rest
              return ( Query query )

resvd_cmd = do { reserved "rd"; return ("rd") }
            <|> do { reserved "eval"; return ("eval") }
            <|> do { reserved "read"; return ("read") }
            <|> do { reserved "in"; return ("in") }
            <|> do { reserved "out"; return ("out") }
            <?> "LINDA-like Tuple"

callOrIdent = do
    name <- resvd_cmd
    liftM (FuncCall name)(parens $ commaSep expr) <|> return (Identifier name)

AST.hs

{-# LANGUAGE DeriveDataTypeable #-}

module AST where

import Data.Typeable

data AST
    = Number Double
    | Identifier String
    | String String
    | FuncCall String [AST]
    | Query String
    deriving (Show, Eq, Typeable)

Lexer.hs

module Lexer (
            identifier, reserved, operator, reservedOp, charLiteral, stringLiteral,
            natural, integer, float, naturalOrFloat, decimal, hexadecimal, octal,
            symbol, lexeme, whiteSpace, parens, braces, angles, brackets, semi,
            comma, colon, dot, semiSep, semiSep1, commaSep, commaSep1
    )where

import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (haskellStyle)

lexer = P.makeTokenParser ( haskellStyle
                            {P.reservedNames = ["rd", "in", "out", "eval", "take"]}
                         )


identifier = P.identifier lexer
reserved = P.reserved lexer
operator = P.operator lexer
reservedOp = P.reservedOp lexer
charLiteral = P.charLiteral lexer
stringLiteral = P.stringLiteral lexer
natural = P.natural lexer
integer = P.integer lexer
float = P.float lexer
naturalOrFloat = P.naturalOrFloat lexer
decimal = P.decimal lexer
hexadecimal = P.hexadecimal lexer
octal = P.octal lexer
symbol = P.symbol lexer
lexeme = P.lexeme lexer
whiteSpace = P.whiteSpace lexer
parens = P.parens lexer
braces = P.braces lexer
angles = P.angles lexer
brackets = P.brackets lexer
semi = P.semi lexer
comma = P.comma lexer
colon = P.colon lexer
dot = P.dot lexer
semiSep = P.semiSep lexer
semiSep1 = P.semiSep1 lexer
commaSep = P.commaSep lexer
commaSep1 = P.commaSep1 lexer

Why do you want "the first string in the commaSep expr parsed as identifier whilst all subsequent strings should be strings only"? You're parsing your Query as an `expr` with `commaSep expr` and then trying to parse it again with `char ','>> theQuery`. If you want to parse the Query completely separately rather than as an `expr`, it's tricky, because `commaSep` will eat the `,` then fail on the next input. If that's really what you need (and I don't think it is because you already have Query in your AST anyway) then you'd need a less optimistic `commaSep` variant. — AndrewC, Nov 25 '12 at 22:18
@AndrewC `isFib` in this case is the name of a worker. The rest are arguments. In order to be able to (later on) distinguish what was the name of the worker and what was a string argument, the one should be parsed is Identifier and all the args either as String or as Number. — J Fritsch, Nov 25 '12 at 22:25
So you'd like to return the name of the worker, some string arguments and a separate query? — AndrewC, Nov 25 '12 at 22:27
@AndrewC Yes, that pretty much works, except that all Strings are seen as worker names. — J Fritsch, Nov 25 '12 at 22:30
...which you don't want. Do you want the query as the last element on the end of a list or separately? — AndrewC, Nov 25 '12 at 22:32
should it be an error if there is more than one query/no query/query not at the end? — AndrewC, Nov 25 '12 at 22:49
@AndrewC Yes. The query is at the end but maybe more than one. — J Fritsch, Nov 25 '12 at 23:01

AndrewC · Accepted Answer · 2012-11-26T12:25:29.797

First, I'd like to introduce you to the function lexeme which alters a parser to eat trailing whitespace. You're encouraged to use it rather than explicitly eating the whitespace. The difficulty is with commaSep because it eats the , and then fails. It would be nice to write a less optimistic commaSep, but let's solve your problem directly.

Let's apply lexeme to comma

acomma = lexeme comma

One of the problems with your code was you were expecting it to see test2 as String "test2" but the astring parser expects its strings to begin and end with ". Let's make a parser for bald strings, but make sure they don't start with ? and don't contain spaces or commas:

baldString = lexeme $ do
   x <- noneOf "? ,)"
   xs <- many (noneOf " ,)")   -- problematic - see comment below
   return . String $ x:xs

The breakthrough came when I realised that because there has to be a query at the end, there was always a comma after a baldString:

baldStringComma = do 
        s <- baldString
        acomma
        return s

Now let's make a parser for one or more queries at the end of the tuple:

queries = commaSep1 (lexeme theQuery)

And now we can take the identifier, the baldStrings and the queries

therest = do
   name <- lexeme ident 
   acomma
   args <- many baldStringComma
   qs <- queries
   return (name,args,qs)

finally giving

tuple = do
    name <- lexeme resvd_cmd
    stuff <- parens therest
    return (name,stuff)

So you get

*Parser> parseTest tuple "rd (isFib, test2, 100.1, ?BOOL)"
("rd",(Identifier "isFib",[String "test2",String "100.1"],[Query "?BOOL"]))

But if you want to lump the strings with the queries, you can return (name,args++qs) at the end of therest.

Applicative is Less Ugly

I found it frustrating to be tied to the Monad interface, when there are lovely things like <$>, <*> etc, so first

import Control.Applicative hiding (many, (<|>))

Then

baldString = lexeme . fmap String $
   (:) <$> noneOf "? ,)"   
       <*> many (noneOf " ,)")   -- problematic - see comment below

Here <$> is an infix version of fmap, so (:) will be applied to the output of noneOf "? ,", giving a parser that returns something like ('c':). This can then be applied to the output of many (noneOf " ,") using <*> to give the string we want.

baldStringComma = baldString <* acomma

This one's nice because we got the <*> operator to ignore the output of acomma and just return the output of baldString, using <*. If we wanted it the other way round, we could do *>, but you may as well use >> for that, which already ignores the output of the first parser.

therest = (,,) <$> 
   lexeme ident <* acomma
   <*> many baldStringComma
   <*> queries

and

tuple = (,) <$> lexeme resvd_cmd 
            <*> parens therest

But wouldn't it be nicer if we did

data Tuple = Tuple {cmd :: String, 
                    id :: AST,
                    argumentList :: [AST],
                    queryList :: [AST]} deriving Show

so we could do

niceTuple = Tuple <$> lexeme resvd_cmd <* lexeme (char '(')
                  <*> lexeme ident <* acomma
                  <*> many baldStringComma
                  <*> queries <* lexeme (char ')')

which gives (with a little manual pretty-printing to get it into the width)

*Parser> parseTest niceTuple "rd (isFib, test2, 100.1, ?BOOL)"
Tuple {cmd = "rd", 
       id = Identifier "isFib", 
       argumentList = [String "test2",String "100.1"], 
       queryList = [Query "?BOOL"]}

I also think your current AST is more of an abstract syntax store than an abstract syntax tree, and that you might get more milage from designing your own Tuple type and use that. Use

newtype Command = Cmd String  deriving Show

and suchlike to ensure type safety, then roll them together into your Tuple type with a parser to generate them.

I feel really uneasy about `baldString`, partly because we should probably do `astring <|> baldString`, but mainly because it matches a whole load of stuff that's not something instead of a whole load of stuff which _is_ something. I'd sleep easier at night if I knew exactly the form that was permitted for a string and parsed that. As it is right now (but I'm about to edit it) baldString will quite happily eat `)`, so if you forget the query, baldString will eat the close bracket and anything it likes until it meets `,` or ` `. `noneOf xs` is a rather dangerous parser, and needs replacing. — AndrewC, Nov 26 '12 at 12:24
Understand what you are saying. I have no clear picture about this yet in how far it maybe helpful to e.g. accept a bytes-string here vs to accept only valid Haskell variable identifiers. — J Fritsch, Nov 26 '12 at 13:47
@JFritsch Yes, I agree - Perhaps it's sensible to restrict for now to Haskell identifiers and possibly numeric literals as in your example. (...and you can quite easily parse Integers and Doubles separately if you want to.) This is a good way round because the error "it should parse this but it doesn't" is so much easier to track down and fix than "it's complaining about something I think should be there, expecting something strange" at a point further on from where the problem actually lies (an over-liberal baldString parser earlier on in the input). — AndrewC, Nov 26 '12 at 21:28
I don't understand the `(:)` , is that a delimiter sort of or a function? — J Fritsch, Nov 27 '12 at 21:07
@JFritsch `(:)` is the prefix version of the `:` infix function. `'H':"ello"` is `"Hello"`, so `(:) 'H' "ello"` is also `"Hello."`. Similarly `(+) 3 4` is `7`. — AndrewC, Nov 27 '12 at 21:24
@JFritsch When you partially apply `(:)` you get what's called an operator section. For example `(:) 'H'` is written `('H':)` for short. It's a function that puts `'H'` at the front of whatever it's given. `('H':) "elecopter"` is `"Helecopter"`. — AndrewC, Nov 27 '12 at 21:29
@JFritsch That's excellently clean and clear Applicative code there that's well laid out. I'd go for _less_ `noneOf`, not more, though. `oneOf` is so much safer, so I think it would be better to define something like `stringChar = letter <|> number <|> oneOf "_'.:$&!-+~#"`, using that instead of ever using noneOf at all. — AndrewC, Nov 28 '12 at 12:29

Parsec and sequence of commaSep input

1 Answers1

Applicative is Less Ugly