I am trying to parse (using magaparsec) the XML export of FreePlane (mindmapper).
This is my third attempt to really 'learn' (internalize) megaparsec. I've written several parsers before, two worked (after a lot of struggling), and one I gave up and parsed manually. I must be missing some fundamental concepts ... Help in that direction would be appreciated
Executive Summary:
I need a Parser that does this ..
- Find the first thing before you find the second thing.
- If there is no first thing before you get to the second thing generate an error
- Don't go past the second thing
I'm probably not explaining it well ..
Here is the current specific problem ..
An XML section is given below (dots arbitray characters) which shows two attribute 'chunks' each chunk surrounded by "< .. >"
... <... LINK="http://www.a.com" ... >...<... LINK="http://b.abc.com > ...
The problem is that the 'LINK' attribute is optional.
The 'chunks' have tons of attributes so I don't know how to parse them individually .. So I skip them arbitrarily only caring about the 'LINK' attribute.
So, if I do the naive thing on the above section ..
manyTill asciiChar (string "LINK=\""
text = manyTill asciiChar (char '"')
return text
it works fine.
The link for the first chunk is returned.
If however the 'LINK' attribute is missing in the first chunk (shown below
...<..... >...<... LINK="http://b.abc.com > ...
my naive parser will return,
which is incorrect.
I want it to error (so later I can use <|> )
It's been a 3 days on this problem trying different approaches .. including just trying to find a parser to do what I wanted ..
i.e. find the first thing (LINK) before you find the second thing ('>') .. If not error.
I finally gave up and tried the old fashioned way .. i.e. extract the text between the "< >", then use ordinary list functions to extract the link name .. and fail if it can't find it. This works but it's quite ugly!
Here's the code ..
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE NoMonomorphismRestriction #-}
import Data.List (find, isInfixOf, isPrefixOf)
import Data.Maybe
import qualified Data.Set as Set
import Data.Void
import Text.Megaparsec
import Text.Megaparsec.Char
type Parser = Parsec Void String
--------------------------------
-- Skip a Chunk with no LINK --
--------------------------------
skipChunk :: Parser String
skipChunk = do
_ <- manyTill asciiChar (char '<')
_ <- manyTill asciiChar (char '>')
return ""
------------------------------------
-- Extrack LINK old fashioned way --
------------------------------------
linkName :: String -> String
linkName text =
let ws = words text
in fromJust $ find (\w -> isPrefixOf "LINK=" w) ws
----------------
-- Parse Link --
----------------
parseLink :: Parser String
parseLink = do
_ <- lookAhead (manyTill asciiChar (char '<')) -- make sure to not consume input
text <- lookAhead (manyTill asciiChar (char '>')) -- capture chunk
name <-
if (isInfixOf "LINK=\"" text) -- extract link old fashioned way
then return $ linkName text
else (failure Nothing Set.empty)
return name
-----------------
-- Base Parser --
-----------------
baseParser :: Parser String
baseParser = do
parseLink <|> skipChunk -- First parser fails, skips the chunk
parseLink -- Yields LINK from second chunk
----------
-- Main --
----------
main :: IO ()
main = do
result <- parseTest baseParser "...<.. INK=\"abc\" ... > ... <... LINK=\"http://b.abc.com > ..."
putStrLn $ show result
This actually works and yields
"LINK="http://b.abc.com"
But wow .. is this ugly !!!
So,
Is there a Parser to "Find the first thing before you find the second thing .. If not error"
It seems like megaparsec is good if you are going character by character and your grammar is well defined and well known. Skipping arbitrary sections seems problematic .. Is this true?
I love Haskell but if anyone can point me to a good book on Visual Basic I'd appreciate it, as that seems to be my intellectual limit.
Thanks in advance,
Tom