Is there a library for extracting data from an HTML page?

Question

I would like to extract information from a web page. Unfortunately, the website (4chan) doesn't have a public API, for as far as I know.

What is a good library to extract specific data from an HTML document? I prefer a free software library that works on UNIX systems.

Edit: basically I want to get posts and images from 4chan. The webpage isn't valid HTML (and doesn't have a doctype) so the parser shouldn't be too strict.

Many exist. You could easily write a book on this subject. It is, after all, just XML. — Samuel Harmer, Jan 23 '12 at 13:13
@R.Martinho For the purposes of scraping, it's much of a muchness — Samuel Harmer, Jan 23 '12 at 13:17
@Styne Yeh, appart from the fact that unless it's XHTML it's not actually valid XML so you can't use an XML parser on it... HTML is actually a markup of SGML and there are specific HTML DOM parsers. — Benj, Jan 23 '12 at 13:20
@Styne - So you're suggesting that the OP find himself a particularly sloppy XML parser? I'll be impressed if you can find one so sloppy that it'll parse a
without a closing tag... — Benj, Jan 23 '12 at 13:23
@Benj Read between the lines. I'm suggesting it's not a specific enough question to have a specific answer that doesn't fill a book. There are many options for parsers depending on what type of HTML he or she is looking to scrape from. And seeing as the only clues we have are *get some kind of information out of some kind of HTML in either C, C++ or Obj-C* that's a question which needs amending. — Samuel Harmer, Jan 23 '12 at 13:27
@Styne - A general purpose HTML parser will work on any "type of HTML" and there are pleanty of HTML parser libraries which can be used from all 3 languages the OP tagged. There's nothing wrong with asking for library recommendations although this question is a dupe of the one mentioned by Kypros. — Benj, Jan 23 '12 at 13:35

score 2 · Answer 1 · edited May 23 '17 at 12:13

2

What you are looking for is an HTML Dom Parse.

This link of a previous question should help you out. Also check out this question

edited May 23 '17 at 12:13

Community

1
1

answered Jan 23 '12 at 13:05

Kypros

2,997
5
21
27

score 0 · Answer 2 · answered Jan 23 '12 at 13:35

0

It is correct, there are lots of libraries for parsing html data. For example, if you use Perl, you can use HTML::Parse.

If you just want a fast result and you agree to use a system command you can use:

lynx -dump http://4chan.org

or

links -dump http://4chan.org

answered Jan 23 '12 at 13:35

atom

375
1
9

Note that the question has `c++`, `objective-c` and `c` tags when it comes to languages ;-) – Michael Krelin - hacker Jan 23 '12 at 13:58
@MichaelKrelin-hacker just saw that. Sorry for being noob :) – atom Jan 23 '12 at 16:52
No prob. Not my prob, for sure ;-) – Michael Krelin - hacker Jan 23 '12 at 17:11

Is there a library for extracting data from an HTML page?

2 Answers2

Linked