I recently made a website that needs to retrieve talk titles from TED website.
So far, the problem is specific to this talk: Francis Collins: We need better drugs -- now
From the web page source, I get:
<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>
<span id="altHeadline" >Francis Collins: We need better drugs -- now</span>
Now, in ghci, I tried this:
λ> :m +Network.HTTP Text.Regex.PCRE
λ> let uri = "http://www.ted.com/talks/francis_collins_we_need_better_drugs_now.html"
λ> body <- (simpleHTTP $ getRequest uri) >>= getResponseBody
λ> body =~ "<span id=\"altHeadline\" >(.+)</span>" :: [[String]]
[["id=\"altHeadline\" >Francis Collins: We need better drugs -- now</span>\n\t\t</h","s Collins: We need better drugs -- now</span"]]
λ> body =~ "<title>(.+)</title>" :: [[String]]
[["tle>Francis Collins: We need better drugs -- now | Video on TED.com</title>\n<l","ncis Collins: We need better drugs -- now | Video on TED.com</t"]]
Either way, the parsed title misses some characters on the left, and has some unintended characters on the right. It seems to have something to do with the --
in talk title. However,
λ> let body' = "<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>"
λ> body' =~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]
Luckily, this is not a problem with Text.Regex.Posix
.
λ> import qualified Text.Regex.Posix as P
λ> body P.=~ "<title>(.+)</title>" :: [[String]]
[["<title>Francis Collins: We need better drugs -- now | Video on TED.com</title>","Francis Collins: We need better drugs -- now | Video on TED.com"]]