0

I am interested in parsing a HTML file using Haskell to search for the strings in the tags, for example, I want to get the string between <body>...</body>. I tried

  getValue :: String -> [String]
  getValue [] = []
  getValue '<':x:'>':y:'<':'/':x:'>':z = y:[]:getValue z

It will enumerate all the cases if at all. But I am interested in the largest ones which are not subsets of any other output element. How do I do that?

1 Answers1

2

The code you wrote matches only a 1-character tag name enclosing a 1-character body.

<p>x</p>          Matches
<ul>y</ul>        Does not match
<p>xyz</p>        Does not match
<body>x</body>    Does not match

I'm guessing that isn't what you want at all.

You can't use pattern-matching to match arbitrary regular expressions; you'll need to use a regex library for that. As I see it, your options are:

  • Use a real HTML or XML parsing library.
  • Use a parser construction library to build a custom parser.
  • Use a real regex library. (See this.)
  • Start writing a simple state-machine by hand.

Which option you choose depends on what you're trying to do. Do you actually want to "solve" the problem, or are you just trying to learn how to do stuff in Haskell?

Community
  • 1
  • 1
MathematicalOrchid
  • 61,854
  • 19
  • 123
  • 220
  • Thank you for your answer. What do you mean by parser construction library? – abhinav mehta Nov 11 '15 at 12:28
  • There are libraries like `tagsoup` which already parse HTML for you. Or there are other libraries like `Parsec` that let you build new parsers for just about anything (e.g., if you designed your own custom language that you wanted to parse). Or you can write the entire thing yourself from scratch. It depends what you're trying to achieve. – MathematicalOrchid Nov 11 '15 at 12:36