Common Lisp package for parsing invalid HTML?

Question

As a learning exercise, I'm writing a web scraper in Common Lisp. The (rough) plan is:

Use Quicklisp to manage dependencies
Use Drakma to load the pages
Parse the pages with xmls

I've just run into a sticking point: the website I'm scraping doesn't always produce valid XHTML. This means that step 3 (parse the pages with xmls) doesn't work. And I'm as loath to use regular expressions as this guy :-)

So, can anyone recommend a Common Lisp package for parsing invalid XHTML? I'm imagining something similar to the HTML Agility Pack for .NET ...

score 11 · Accepted Answer · answered Jan 05 '11 at 01:11

11

The "closure-html" project (available in Quicklisp) will recover from bogus HTML and produce something with which you can work. I use closure-html together with CXML to process arbitrary web pages, and it works nicely. http://common-lisp.net/project/closure/closure-html/

answered Jan 05 '11 at 01:11

Xach

11,774
37
38

1

closure-html doesn't seem to work under GNU CLISP - but (without wishing to ignite a holy war) it looks like a move to SBCL will be painless. I still feel like I'm groping around in the Lisp wilderness, but at least now I can hear voices. Hopefully they're not just in my head :-) – Duncan Bayne Jan 05 '11 at 07:10

Ehvince · Answer 2 · 2017-09-25T20:28:09.657

For next visitors: today we have Plump: https://shinmera.github.io/plump

Plump is a parser for HTML/XML like documents, focusing on being lenient towards invalid markup. It can handle things like invalid attributes, bad closing tag order, unencoded entities, inexistent tag types, self-closing tags and so on. It parses documents to a class representation and offers a small set of DOM functions to manipulate it. You are free to change it to parse to your own classes though.

and them we have other libs to query the document, like lquery (jquery-like) or CLSS (simple CSS selectors) by the same author.

We also now have a little tutorial on the Common Lisp Cookbook: https://lispcookbook.github.io/cl-cookbook/web-scraping.html

See also Common Lisp wiki: http://www.cliki.net/Web

and now there's a little tutorial: https://lispcookbook.github.io/cl-cookbook/web-scraping.html — Ehvince, Jul 05 '17 at 22:48

score 1 · Answer 3 · answered Apr 13 '11 at 14:55

1

Duncan, so far I've been successful using Clozure Common Lisp under both Ubuntu Linux and Windows (7 & XP), so if you're looking for an implementation that will work anywhere you might try this one.

answered Apr 13 '11 at 14:55

RazvanP

19
1

Common Lisp package for parsing invalid HTML?

3 Answers3