Web Scraping With Haskell

Question

What is the current state of libraries for scraping websites with Haskell?

I'm trying to make myself do more of my quick oneoff tasks in Haskell, in order to help increase my comfort level with the language.

In Python, I tend to use the excellent PyQuery library for this. Is there something similarly simple and easy in Haskell? I've looked into Tag Soup, and while the parser itself seems nice, actually traversing pages doesn't seem as nice as it is in other languages.

Is there a better option out there?

The functions for searching the parsed document seem more limited than libraries in other languages. The general purpose functions such as sections don't seem that bad, but it still requires several lines of code for some really common uses. For example, selecting an element by class requires at least a couple lines of code to do what would be a single call in jquery. That wouldn't be bad for one single project, but my typical use case for this is a small oneoff project. So I either maintain some helpers, or repeat myself a bunch. Am I missing something? — ricree, Jan 31 '11 at 14:03

sclv · Answer 1 · 2018-09-13T00:15:35.573

36

http://hackage.haskell.org/package/shpider

Shpider is a web automation library for Haskell. It allows you to quickly write crawlers, and for simple cases ( like following links ) even without reading the page source.

It has useful features such as turning relative links from a page into absolute links, options to authorize transactions only on a given domain, and the option to only download html documents.

It also provides a nice syntax for filling out forms.

An example:

 runShpider $ do
      download "http://apage.com"
      theForm : _ <- getFormsByAction "http://anotherpage.com"
      sendForm $ fillOutForm theForm $ pairs $ do
            "occupation" =: "unemployed Haskell programmer"
            "location" =: "mother's house"

(Edit in 2018 -- shpider is deprecated, these days https://hackage.haskell.org/package/scalpel might be a good replacement)

edited Sep 13 '18 at 00:15

answered Jan 30 '11 at 06:10

sclv

38,665
7
99
204

1

Interesting. It looks like shpider could have uses for web testing as well. – Michael Snoyman Jan 30 '11 at 09:22
Michael, did you use it for testing? – Qrilka Aug 29 '12 at 20:48
I have a problem to install shpider on ghc 7.6.2 – Anton May 16 '13 at 16:54
+/-0: +1 Great interface, very intuitive, excellent use of monads. -1 Doesn't work with ASPX sites, fails to correctly parse forms. Only looking for a spider library so I don't have to deal with low level ASPX site insanity... – recursion.ninja Jul 31 '14 at 19:10
This seems to depend on deprecated `web-encodings` which depends on outdated libraries. Is there an updated alternative? – unhammer Sep 11 '18 at 08:50

score 23 · Accepted Answer · answered Jan 29 '11 at 17:18

23

From my searching on the Haskell mailing lists, it appears that TagSoup is the dominant choice for parsing pages. For example: http://www.haskell.org/pipermail/haskell-cafe/2008-August/045721.html

As far as the other aspects of web scraping (such as crawling, spidering, and caching), I searched http://hackage.haskell.org/package/ for those keywords but didn't find anything promising. I even skimmed through packages mentioning "http" but nothing jumped out at me.

Note: I'm not a regular Haskeller, so I hope others can chime in if I missed something.

answered Jan 29 '11 at 17:18

David J.

31,569
22
122
174

The Haskell XML Toolbox (HXT) might be worth looking at: http://en.wikibooks.org/wiki/Haskell/XML – David J. Jan 29 '11 at 17:26
8

I can vouch for TagSoup: I used it exclusively for a project that was entirely based around HTML scraping. As for HTTP client packages, I wrote [http-enumerator](http://hackage.haskell.org/package/http-enumerator) specifically because I did not see any good alternatives. – Michael Snoyman Jan 30 '11 at 09:19
3

There is scalpel which builds on TagSoup https://github.com/fimad/scalpel – guido Dec 27 '16 at 08:47
There is also https://github.com/bgamari/html-parse which claims to be 8x faster than tagsoup – unhammer Jan 26 '19 at 12:58

score 11 · Answer 3 · edited May 23 '17 at 12:16

Although I'm still for now a beginner in Haskell, I have the strong opinion that HTML parsing in 2012 must be done using CSS selectors, and it seems the libraries recommended so far don't use that principle.

One possibility is HandsomeSoup, which is built on top of HXT:

http://egonschiele.github.com/HandsomeSoup/

http://codingtales.com/2012/04/25/scraping-html-with-handsomesoup-in-haskell

This page about HXT, on which HandsomeSoup relies, will also be helpful (you're going to need getText or deep getText):

http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html

But another choice is dom-selector:

http://hackage.haskell.org/package/dom-selector

It is right now alpha and its long-term maintenance could be a problem. The advantage of dom-selector is that I couldn't get unicode characters to work with HandsomeSoup. They worked out of the box with dom-selector.

This question is related to that: Is it possible to use Text or ByteString on HXT in Haskell?

dom-selector is based on html-conduit and xml-conduit, for which maintenance appears assured.

EDIT: note my newer answer about lens-based parsing. I left this answer as it's still good on its own, but I would now personally rather use the other approach.

HandsomeSoup is a really neat library. Thanks for sharing this or I wouldn't have known where to look! — thegravian, Oct 11 '13 at 19:05

score 6 · Answer 4 · answered Jul 12 '14 at 20:53

I wrote another answer to this question already, suggesting CSS selectors-based parsing, however that answer is now a year and a half old, and nowadays I think lenses might be a better approach in haskell. In effect you get something like type-safe compiled selectors.

See this reddit discussion for a couple of options in that vein. In case the link disappears, I copy the direct links:

taggy-lens
hexpat-lens
xml-lens, which apparently can be used with html-conduit

I have used none of those yet, but if I would write new code parsing HTML today, I would definitely go with a lens-based approach.

Web Scraping With Haskell

4 Answers4