Missing characters using Text.Regex.PCRE to parse web page title

Question

I recently made a website that needs to retrieve talk titles from TED website.

So far, the problem is specific to this talk: Francis Collins: We need better drugs -- now

From the web page source, I get:

<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>

Now, in ghci, I tried this:

λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]]
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]

Either way, the parsed title misses some characters on the left, and has some unintended characters on the right. It seems to have something to do with the -- in talk title. However,

λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

Luckily, this is not a problem with Text.Regex.Posix.

λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]

score 4 · Accepted Answer · answered Mar 27 '13 at 12:25

My recommendation would be: don't use a regex for parsing HTML. Use a proper HTML parser instead. Here's an example using the html-conduit parser together with the xml-conduit cursor library (and http-conduit for download).

{-# LANGUAGE OverloadedStrings #-}
import           Data.Monoid          (mconcat)
import           Network.HTTP.Conduit (simpleHttp)
import           Text.HTML.DOM        (parseLBS)
import           Text.XML.Cursor      (attributeIs, content, element,
                                       fromDocument, ($//), (&//), (>=>))

main = do
    lbs <- simpleHttp "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
    let doc = parseLBS lbs
        cursor = fromDocument doc
    print $ mconcat $ cursor $// element "title" &// content
    print $ mconcat $ cursor $// element "span" >=> attributeIs "id" "altHeadline" &// content

The code is also available as active code on the School of Haskell.

Thank you for your suggestion. I know when programming in haskell, there always are multiple ways to solve the same problem. But as a beginner, I'm content with mere working code. I sure will take your advice when refactoring. — rnons, Mar 27 '13 at 13:10
This wasn't Haskell-specific advice. Using a regex for HTML/XML parsing is generally not a good idea. Have a look at: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Michael Snoyman, Mar 27 '13 at 14:09

Missing characters using Text.Regex.PCRE to parse web page title

1 Answers1