0

I am currently working on this program in Haskell where I analyze a website and try to find all links (href) that belong to this website. I was already able to extract all the links of the main site but i am struggling with the recursion since i want to follow the links I already found and do the same process again.

This is what i have already:

parseHtml = fmap LB.unpack . simpleHttp
filterFunc x y = -- damn long line with a lot of filters

main :: IO()
main = do
    let site = "https://stackoverflow.com/"
    url <- parseHtml site
    let links = filterFunc site url
    mapM_ print $ take 5 $ links

And this is my output so far:

"https://stackoverflow.com/company/about"
"https://stackoverflow.com/company/work-here"
"https://stackoverflow.com/help"
"https://stackoverflow.com/jobs/directory/developer-jobs"
"https://stackoverflow.com/questions/44691577/stream-versus-iterators-in-set"

I just need a hint on how to further proceed and how to visit the already found links again. Should I work with fold?

karel
  • 5,489
  • 46
  • 45
  • 50
Sarah K.
  • 17
  • 1

2 Answers2

1

Link finding is essentially a graph traversal problem, which can be tricky in Haskell because of functional purity: it's hard to explicitly mark nodes (links) as visited or not through the use of an external history table.

Your typical traversal algorithm might look something like this:

function traverse(current_node) {
    if (current_node.is_visited) {
        return some_data;
    } else {
        current_node.is_visisted = true;          // Hard in Haskell!
        accumulated_data = ...;
        for (child in current_node.children()) {
            accumulated_data += traverse(child);  // Recursion happens here.
        }
        return accumulated_data;
    }
}

Because there is not an easy, direct way to mark a node as visited or not, we can try other solutions. For instance, we might consider something of the sort:

traverse :: ([URL], Data) -> URL -> ([URL], Data)
traverse (history, datum) current = let ... in ([new_history], accumulated_data)

The idea here is as follows: we keep an explicit list of URLs that we have visited. This allows us to quickly return from the current node (URL) if it appears in our history list (perhaps a Set for optimization? :)). In this case, each subsequent call to a child node using traverse would get the new_history list, effectively keeping track of a list of visited and unvisisted URLs.

One possible way to implement this is using a fold function such as foldl:

foldl :: Foldable t => (b -> a -> b) -> b -> t a -> b

Here type t a might be [URL], that denotes the children of the current link, and our traverse function conveniently has the type signature (b -> a -> b), where type b = ([URL], Data) and type a = URL.

Can you take it from here and figure out how to combine traverse and foldl?

Anton Xue
  • 813
  • 11
  • 25
  • You may want to use a `Set URL` instead of `[URL]`. Both for efficiency and because this is what you mean when you keep around that list. – gallais Jun 22 '17 at 11:54
  • 1
    I have this code snippet in my filterfunc `Set.toList . Set.fromList` to get rid of all the duplicates i get. Should i just keep it as a set to make it easier to work with it? – Sarah K. Jun 22 '17 at 12:08
  • 1
    Keeping it as a `Set` is probably the easiest option, as @gallais has pointed out. Another thing to watch out for is that links with different names may point to the same page, so it is probably wise to also get a canonical representation for them. – Anton Xue Jun 22 '17 at 12:28
  • If for some reason you need to keep unique representations in `List` format, one option is to use `List.nub`: http://hackage.haskell.org/package/base-4.9.1.0/docs/Data-List.html#v:nub – Anton Xue Jun 22 '17 at 12:33
0

Simply move your link visiting logic in a separate function which takes a link as a parameter, and then recurse on the links, as you intuited.

Depending on what you want to ultimately do with the links, you can for instance simply fold the links with your function.

For example, slightly modifying your code:

parseHtml = fmap LB.unpack . simpleHttp
filterFunc x y = -- damn long line with a lot of filters

visitLink :: String -> IO ()
visitLink site = do
    url <- parseHtml site
    let links = filterFunc site url
    mapM_ print $ take 5 $ links -- or whatever you want to do on your links
    mapM_ visitLink links -- the recursive call


main :: IO()
main = visitLinks "https://stackoverflow.com/"

If, rather than printing the links as you go, you would rather for instance return them, tweak the return type of the visitLink function (for instance String -> IO [String] and change your last line in visitLink suitably (for instance fmap join $ mapM visitLinks links).

As mentionned in another answer, keep in mind that with such a simple code, you may visit the same link infinitely many times. Consider storing the links you visit in a suitable data structure (such as a set) that you would pass to visitLink.

gchelfi
  • 89
  • 5
  • I only printed them to make sure everything is working and to see what the outcome is so i know how to work with it. The last step would be to save all the links i get into a file. – Sarah K. Jun 22 '17 at 11:16
  • And i forget to say that i'm using a Set in my filterfunc (Set.toList . Set.fromList) to get rid of all the duplictaes – Sarah K. Jun 22 '17 at 11:19
  • @SarahK. sure, what I meant is that the solution I propose works for everything that fits in an `IO ()` (including writing for a file), but you may have to adapt it for other usages. Regarding my Set comment, keep in mind that a later page can refer to a former one (for instance a page linking to the home page of the site) and you need to take this into account. – gchelfi Jun 22 '17 at 12:45