-1

I would like to extract information from a web page. Unfortunately, the website (4chan) doesn't have a public API, for as far as I know.

What is a good library to extract specific data from an HTML document? I prefer a free software library that works on UNIX systems.


Edit: basically I want to get posts and images from 4chan. The webpage isn't valid HTML (and doesn't have a doctype) so the parser shouldn't be too strict.

  • Many exist. You could easily write a book on this subject. It is, after all, just XML. – Samuel Harmer Jan 23 '12 at 13:13
  • 2
    @Styne HTML is not XML. Only XHTML is valid XML. – R. Martinho Fernandes Jan 23 '12 at 13:15
  • @R.Martinho For the purposes of scraping, it's much of a muchness – Samuel Harmer Jan 23 '12 at 13:17
  • @Styne Yeh, appart from the fact that unless it's XHTML it's not actually valid XML so you can't use an XML parser on it... HTML is actually a markup of SGML and there are specific HTML DOM parsers. – Benj Jan 23 '12 at 13:20
  • @Benj That depends on how strict your parser is. – Samuel Harmer Jan 23 '12 at 13:21
  • @Styne - So you're suggesting that the OP find himself a particularly sloppy XML parser? I'll be impressed if you can find one so sloppy that it'll parse a
    without a closing tag...
    – Benj Jan 23 '12 at 13:23
  • @Benj Read between the lines. I'm suggesting it's not a specific enough question to have a specific answer that doesn't fill a book. There are many options for parsers depending on what type of HTML he or she is looking to scrape from. And seeing as the only clues we have are *get some kind of information out of some kind of HTML in either C, C++ or Obj-C* that's a question which needs amending. – Samuel Harmer Jan 23 '12 at 13:27
  • @Styne - A general purpose HTML parser will work on any "type of HTML" and there are pleanty of HTML parser libraries which can be used from all 3 languages the OP tagged. There's nothing wrong with asking for library recommendations although this question is a dupe of the one mentioned by Kypros. – Benj Jan 23 '12 at 13:35

2 Answers2

2

What you are looking for is an HTML Dom Parse.

This link of a previous question should help you out. Also check out this question

Community
  • 1
  • 1
Kypros
  • 2,997
  • 5
  • 21
  • 27
0

It is correct, there are lots of libraries for parsing html data. For example, if you use Perl, you can use HTML::Parse.

If you just want a fast result and you agree to use a system command you can use:

lynx -dump http://4chan.org

or

links -dump http://4chan.org
atom
  • 375
  • 1
  • 9