6

I'm trying to figure my way through HXT with XPath and arrows at the same time and I'm completely stuck on how to think through this problem. I've got the following HTML:

<div>
<div class="c1">a</div> 
<div class="c2">b</div> 
<div class="c3">123</div> 
<div class="c4">234</div> 
</div>

which I've extracted into an HXT XmlTree. What I'd like to do is define a function (I think?):

getValues :: [String] -> IOSArrow Xmltree [(String, String)]

Which, if used as getValues ["c1", "c2", "c3", "c4"], will get me:

[("c1", "a"), ("c2", "b"), ("c3", "123"), ("c4", "234")]

Help please?

Travis Brown
  • 138,631
  • 12
  • 375
  • 680
Muchin
  • 4,887
  • 4
  • 23
  • 25

3 Answers3

3

Here's one approach (my types are a bit more general and I'm not using XPath):

{-# LANGUAGE Arrows #-}
module Main where

import qualified Data.Map as M
import Text.XML.HXT.Arrow

classes :: (ArrowXml a) => a XmlTree (M.Map String String)
classes = listA (divs >>> divs >>> pairs) >>> arr M.fromList
  where
    divs = getChildren >>> hasName "div"
    pairs = proc div -> do
      cls <- getAttrValue "class" -< div
      val <- deep getText         -< div
      returnA -< (cls, val)

getValues :: (ArrowXml a) => [String] -> a XmlTree [(String, Maybe String)]
getValues cs = classes >>> arr (zip cs . lookupValues cs)
  where lookupValues cs m = map (flip M.lookup m) cs

main = do
  let xml = "<div><div class='c1'>a</div><div class='c2'>b</div>\
            \<div class='c3'>123</div><div class='c4'>234</div></div>"

  print =<< runX (readString [] xml >>> getValues ["c1", "c2", "c3", "c4"])

I would probably run an arrow to get the map and then do the lookups, but this way works as well.


To answer your question about listA: divs >>> divs >>> pairs is a list arrow with type a XmlTree (String, String)—i.e., it's a non-deterministic computation that takes an XML tree and returns string pairs.

arr M.fromList has type a [(String, String)] (M.Map String String). This means we can't just compose it with divs >>> divs >>> pairs, since the types don't match up.

listA solves this problem: it collapses divs >>> divs >>> pairs into a deterministic version with type a XmlTree [(String, String)], which is exactly what we need.

Travis Brown
  • 138,631
  • 12
  • 375
  • 680
0

Here is a way to do it using HandsomeSoup:

-- For the join function.
import Data.String.Utils
import Text.HandsomeSoup
import Text.XML.HXT.Core

-- Of each element, get class attribute and text.
getItem = (this ! "class" &&& (this /> getText))  
getItems selectors = css (join "," selectors) >>> getItem

main = do
  let selectors = [".c1", ".c2", ".c3", ".c4"]
  items <- runX (readDocument [] "data.html" >>> getItems selectors)
  print items

data.html is the HTML file.

Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
0

Even though it's now 10 years after the original post, this is still very useful advice.

For anyone else stuck with Haskell XML processing in 2020, I can confirm that the first example works fine on the following system:

ghci --version The Glorious Glasgow Haskell Compilation System, version 8.8.4

Mac OS Catalina (10.15.7)

Thanks very much for helping me out - it's saved me a lot of time.

  • Thank you for your comment, it is appreciated and adds value. It can become unclear whether aged answers are still relevant or not. As a small note, it might however have been better placed as a comment on the original answer which would make clear which answer you're referring to etc, rather than as an answer to the question. Thanks again however and welcome to StackOverflow. – W.Prins Dec 03 '20 at 21:49