Can you provide examples of parsing HTML?

Question

How do you parse HTML with a variety of languages and parsing libraries?

When answering:

Individual comments will be linked to in answers to questions about how to parse HTML with regexes as a way of showing the right way to do things.

For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags. To make it easy to search this question, I ask that you follow this format

Language: [language name]

Library: [library name]

[example code]

Please make the library a link to the documentation for the library. If you want to provide an example other than extracting links, please also include:

Purpose: [what the parse does]

and why you are clutting perl code with pointless/useless use directives? (warnings and strict) — dfa, Apr 22 '09 at 09:48
Self contained, working examples are better. All Perl code should include strict and warnings, they are not pointless; they are a part of Modern Perl. I shudder to think what your code looks like if you think they are "pointless" and "useless". — Chas. Owens, Apr 22 '09 at 11:21
in my code I always use warnings and strict; in *THIS* context they are pointless. Most of this samples are not "self contained" (e.g. jquery, ruby and other answers) so why bother with perl-based solutions? — dfa, Apr 22 '09 at 13:21
Because you can, and the JavaScript examples are self contained in their environment. I haven't changed the nokogiri example because I can't get nokogiri to install on my machine. I don't want to change code I don't understand. But I will change it; for one thing it doesn't look like it is solving the example. As for using strict, modeling unsafe code for people who are learning is a crime. They need all of the reinforcement they can get. — Chas. Owens, Apr 22 '09 at 13:55
you are adding distracting things; use strict and resource handling is not the central point of the question — dfa, Apr 22 '09 at 18:02
@Ira Baxter What part of "This question is a lazy way of collecting examples of parsing HTML with a variety of languages and parsing libraries." did you not understand? — Chas. Owens, Mar 25 '11 at 13:00
@Jack Yes, they are tagged so someone will **provide** an example. — Chas. Owens, Jun 18 '12 at 12:34

score 29 · Answer 1 · edited Nov 24 '09 at 13:22

29

Language: JavaScript
Library: jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

(using firebug console.debug for output...)

And loading any html page:

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

Used another each function for this one, I think it's cleaner when chaining methods.

edited Nov 24 '09 at 13:22

Peter Mortensen

30,738
21
105
131

answered Apr 21 '09 at 18:35

Ward Werbrouck

1,442
18
32

Well yes, if you look at it that way. :) But using javascript/jquery for parsing HTML feels very natural, it's perfect for stuff like this. – Ward Werbrouck Apr 21 '09 at 22:32
Using the browser as the parser is the ultimate parser. The DOM in a given browser *is* the document tree. – Chas. Owens Apr 21 '09 at 22:37

score 25 · Answer 2 · edited Apr 21 '09 at 18:45

25

Language: C#
Library: HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}

edited Apr 21 '09 at 18:45

Chas. Owens

64,182
22
135
226

answered Apr 21 '09 at 17:47

alexn

57,867
14
111
145

score 22 · Answer 3 · edited Apr 21 '09 at 18:18

22

language: Python
library: BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links

output:

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

also possible:

for link in links:
    print link['href']

output:

http://foo.com
http://bar.com
http://baz.com

edited Apr 21 '09 at 18:18

Miles

31,360
7
64
74

answered Apr 21 '09 at 16:50

Paolo Bergantino

480,997
81
517
436

This is nice, but does BeautifulSoup provide a way of looking into the tags to get the attributes? *goes off to look at docs* – Chas. Owens Apr 21 '09 at 17:16
1

The output in the first example is just the text representation of the matched links, they are actually objects to which you can do all kinds of fun stuff. – Paolo Bergantino Apr 21 '09 at 17:18
1

Yeah, I just read the docs, you just beat me to fixing the code. I did add the try/catch to prevent it from blowing up when href isn't there though. Apparently "'href' in link" doesn't work. – Chas. Owens Apr 21 '09 at 17:24

score 20 · Answer 4 · edited Apr 21 '17 at 15:25

20

Language: Perl
Library: pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

edited Apr 21 '17 at 15:25

AndyG

39,700
8
109
143

answered Apr 21 '09 at 20:56

draegtun

22,441
5
48
71

1

That's brilliant. Never knew about pQuery, but it looks very cool. – Apr 21 '09 at 21:43
Can you search for 'a[@href]' or 'a[href]' as in jQuery? It would simplify the code, and quite sure be faster. – Ward Werbrouck Apr 21 '09 at 22:51
1

Here are some other stackoverflow questions with pQuery answers... http://stackoverflow.com/questions/713827/how-can-i-screen-scrape-with-perl/713846#713846 http://stackoverflow.com/questions/574199/how-do-i-extract-an-html-title-with-perl http://stackoverflow.com/questions/254345/how-can-i-extract-urls-from-a-web-page-in-perl/254506#254506 http://stackoverflow.com/questions/221091/how-can-i-extract-xml-of-a-website-and-save-in-a-file-using-perls-lwp/223662#223662 – draegtun Apr 22 '09 at 10:10
@code-is-art: Unfortunately not yet... to quote author from docs "The selector syntax is still very limited. (Single tags, IDs and classes only)". Checkout the tests because pQuery does have features that aren't in the documentation, for eg. say 'Number of with "blah" content - ', pQuery('td:contains(blah)')->size; – draegtun Apr 22 '09 at 10:19

score 15 · Answer 5 · edited Apr 21 '09 at 16:59

15

language: shell
library: lynx (well, it's not library, but in shell, every program is kind-of library)

lynx -dump -listonly http://news.google.com/

edited Apr 21 '09 at 16:59

Chas. Owens

64,182
22
135
226

answered Apr 21 '09 at 16:45

+1 for trying, +1 for a working solution, -1 for the solution not being generalizable to other tasks: net +1 – Chas. Owens Apr 21 '09 at 16:56
7

well, the task was quite well defined - it had to extract links from "a" tags. :) – Apr 21 '09 at 17:10
Yes, but it is defined as an example to show how to parse, I could have just as easily asked you to print all of the contents of tags that had the class "phonenum". – Chas. Owens Apr 21 '09 at 17:27
3

I agree that this doesn't help with the generic question, but the specific question is likely to be a popular one, so it seems reasonable to me as a way to do it for a specific domain of the general problem. – Tanktalus Apr 21 '09 at 18:49

score 14 · Answer 6 · edited Apr 21 '09 at 16:49

14

language: Ruby
library: Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }

edited Apr 21 '09 at 16:49

Chas. Owens

64,182
22
135
226

answered Apr 21 '09 at 16:16

Pesto

23,810
2
71
76

Chas. Owens · Answer 7 · 2009-04-21T16:53:06.903

language: Python
library: HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        at = dict(attrs)
        if tag == 'a' and 'href' in at:
            print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

find.feed(html)

Chas. Owens · Answer 8 · 2009-04-21T16:48:22.797

11

language: Perl
library: HTML::Parser

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my $find_links = HTML::Parser->new(
    start_h => [
        sub {
            my ($tag, $attr) = @_;
            if ($tag eq 'a' and exists $attr->{href}) {
                print "$attr->{href}\n";
            }
        }, 
        "tag, attr"
    ]
);

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_links->parse($html);

edited Apr 21 '09 at 16:48

answered Apr 21 '09 at 16:10

Chas. Owens

64,182
22
135
226

Using LWP::Simple to download this page (as I do below in my perl example) showed that you found a's that didn't have href's (but had names), so we just want to check that there *is* an href before printing it. – Tanktalus Apr 21 '09 at 16:40

score 9 · Answer 9 · answered Apr 21 '09 at 17:01

9

Language Perl
Library: HTML::LinkExtor

Beauty of Perl is that you have modules for very specific tasks. Like link extraction.

Whole program:

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

Explanation:

use strict - turns on "strict" mode - eases potential debugging, not fully relevant to the example
use HTML::LinkExtor - load of interesting module
use LWP::Simple - just a simple way to get some html for tests
my $url = 'http://www.google.com/' - which page we will be extracting urls from
my $content = get( $url ) - fetches page html
my $p = HTML::LinkExtor->new( \&process_link, $url ) - creates LinkExtor object, givin it reference to function that will be used as callback on every url, and $url to use as BASEURL for relative urls
$p->parse( $content ) - pretty obvious I guess
exit - end of program
sub process_link - begin of function process_link
my ($tag, %attr) - get arguments, which are tag name, and its atributes
return unless $tag eq 'a' - skip processing if the tag is not <a>
return unless defeined $attr{'href'} - skip processing if the <a> tag doesn't have href attribute
print "- $attr{'href'}\n"; - pretty obvious I guess :)
return; - finish the function

That's all.

answered Apr 21 '09 at 17:01

Nice, but I think you are missing the point of the question, the example is there so that the code will be similar, not because I want the links. Think in more general terms. The goal is to provide people with the tools to use parsers instead of regexes. – Chas. Owens Apr 21 '09 at 17:12
5

It is possible that I miss something, but I read in the problem description: "For the sake of consistency, I ask that the example be parsing an HTML file for the href in anchor tags." If you'd ask for example of parsing tags - I would probably use HTML::TableExtract - basically - specialized tool beats (in my opinion) general tool. – Apr 21 '09 at 17:38
Fine, find all span tags that have the class "to_understand_intent" that are inside of div tags whose class is "learn". Specialized tools are great, but they are just that: specialized. You will wind up needing to know the general tool one day. This is a question about the general tools, not specialized libraries that use those tools. – Chas. Owens Apr 21 '09 at 18:32
4

For this new request - of course HTML::Parser would be much better. But just saying "use HTML::Parser" is plain wrong. One should use proper tool for a given task. For extracting hrefs I would say that using HTML::Parser is overkill. For extracting s - as well. Asking "give me general way to parse ..." is wrong because it assumes that there exists 1 tool (in language) that's perfect for all cases. I personally parse HTML in at least 6 different ways, depending on what I need to do. – Apr 21 '09 at 21:38
Look at the task again. The task was not get the links in an HTMl page, it was demonstrate how your favorite parser works using getting the links in an HTML page as an example. It was chosen because it is a simple task that involves finding the right tag and looking at a piece of data in it. It was also chosen because it is a common task. Because it is a common task Perl has automated it for you, but that doesn't mean this question was asking for you to give the automated answer. – Chas. Owens Apr 21 '09 at 22:41
@Chas. Owens: the task was specific; given that a solution exists on CPAN, there should be an example using it (as well as a more general HTML::Parser example). And it isn't fully automated; you have to filter it to just anchor tag href attributes - how to do so is worth showing in an example. – ysth Apr 22 '09 at 00:49
Shorter: HTML::LinkExtor->new( sub{ print $_[2] if $_[0] eq "a" } )->parse_file("sample.html") – ysth Apr 22 '09 at 00:50
@ysth, the task was chose at random, I could have chosen anything. As I have stated several times, the purpose of this question is to collect examples of full parsers, not to solve the example with specialized libraries that use parsers. This answer would be fine if the question was "How do I extract links from HTML with Perl?", but the question is "Can you provide an example of parsing HTML with your favorite parser?"; therefore the task is demonstrate a parser, not extract links. – Chas. Owens Apr 22 '09 at 04:02

score 8 · Answer 10 · edited Apr 21 '09 at 22:25

8

Language: Ruby
Library: Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"

edited Apr 21 '09 at 22:25

Chas. Owens

64,182
22
135
226

answered Apr 21 '09 at 21:01

Jules Glegg

302
1
3

score 8 · Answer 11 · edited Apr 28 '09 at 13:31

8

Language: Common Lisp
Library: Closure Html, Closure Xml, CL-WHO

(shown using DOM API, without using XPATH or STP API)

(defvar *html*
  (who:with-html-output-to-string (stream)
    (:html
     (:body (loop
               for site in (list "foo" "bar" "baz")
               do (who:htm (:a :href (format nil "http://~A.com/" site))))))))

(defvar *dom*
  (chtml:parse *html* (cxml-dom:make-dom-builder)))

(loop
   for tag across (dom:get-elements-by-tag-name *dom* "a")
   collect (dom:get-attribute tag "href"))
=> 
("http://foo.com/" "http://bar.com/" "http://baz.com/")

edited Apr 28 '09 at 13:31

Chas. Owens

64,182
22
135
226

answered Apr 28 '09 at 05:57

dmitry_vk

4,399
18
21

does collect or dom:get-attribute correctly handle tags that do not have href set? – Chas. Owens Apr 28 '09 at 14:56
2

Depending on definition of correctness. In example as it is shown, empty strings will be collected for "a" tags with no "href" attribute. If loop is rewritten as (loop for tag across (dom:get-elements-by-tag-name *dom* "a") when (string/= (dom:get-attribute tag "href") "") collect (dom:get-attribute tag "href")) then only non-empty "href"s will be collected. – dmitry_vk Apr 28 '09 at 15:19
Actually, that's not when (string/= (dom:get-attribute tag "href") "") but when (dom:has-attribute tag "href") – dmitry_vk Apr 28 '09 at 15:35
How would you do that without the loop macro? – davorb Feb 16 '13 at 23:58

score 6 · Answer 12 · answered Mar 05 '10 at 19:01

Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)

Selector expression:

(def test-select
     (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))

Now we can do the following at the REPL (I've added line breaks in test-select):

user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
 {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
 {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")

You'll need the following to try it out:

Preamble:

(require '[net.cgrand.enlive-html :as html])

Test HTML:

(def test-html
     (apply str (concat ["<html><body>"]
                        (for [link ["foo" "bar" "baz"]]
                          (str "<a href=\"http://" link ".com/\">" link "</a>"))
                        ["</body></html>"])))

Not sure if I'd call Enlive a "parser", but I'd certainly use it in place of one, so -- here's an example. — Michał Marczyk, Mar 05 '10 at 19:03

score 5 · Answer 13 · edited Apr 21 '09 at 16:45

language: Perl
library: XML::Twig

#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';

use LWP::Simple;
use XML::Twig;

#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;

my $twig = XML::Twig->new();
$twig->parse_html($content);

my @hrefs = map {
    $_->att('href');
} $twig->get_xpath('//*[@href]');

print "$_\n" for @hrefs;

caveat: Can get wide-character errors with pages like this one (changing the url to the one commented out will get this error), but the HTML::Parser solution above doesn't share this problem.

Nice, I use XML::Twig all the time and never realized there was a parse_html method. — Chas. Owens, Apr 21 '09 at 16:39

score 5 · Answer 14 · edited May 23 '17 at 12:17

5

Language: Perl
Library: HTML::Parser
Purpose: How can I remove unused, nested HTML span tags with a Perl regex?

edited May 23 '17 at 12:17

Community

1
1

answered Apr 21 '09 at 16:51

runrig

6,486
2
27
44

laz · Answer 15 · 2009-04-21T20:57:18.267

Language: Java
Libraries: XOM, TagSoup

I've included intentionally malformed and inconsistent XML in this sample.

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Parser parser = new Parser();
        parser.setFeature(Parser.namespacesFeature, false);
        final Builder builder = new Builder(parser);
        final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
        final Element root = document.getRootElement();
        final Nodes links = root.query("//a[@href]");
        for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
            final Node node = links.get(linkNumber);
            System.out.println(((Element) node).getAttributeValue("href"));
        }
    }
}

TagSoup adds an XML namespace referencing XHTML to the document by default. I've chosen to suppress that in this sample. Using the default behavior would require the call to root.query to include a namespace like so:

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

I'm sure either will work fine. TagSoup was made to parse whatever you can throw at it. — laz, Apr 22 '09 at 00:23

score 4 · Answer 16 · edited Mar 11 '16 at 21:10

Language: Racket

Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)

(require net/url
         (planet ashinn/html-parser:1)
         (planet clements/sxml2:1))

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))

Above example using packages from the new package system: html-parsing and sxml

(require net/url
         html-parsing
         sxml)

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))

Note: Install the required packages with 'raco' from a command line, with:

raco pkg install html-parsing

and:

raco pkg install sxml

score 4 · Answer 17 · edited Nov 24 '09 at 13:23

4

Language: JavaScript
Library: DOM

var links = document.links;
for(var i in links){
    var href = links[i].href;
    if(href != null) console.debug(href);
}

(using firebug console.debug for output...)

edited Nov 24 '09 at 13:23

Peter Mortensen

30,738
21
105
131

answered Apr 21 '09 at 18:17

Ward Werbrouck

1,442
18
32

score 4 · Answer 18 · edited Apr 21 '09 at 18:41

Language: C#
Library: System.XML (standard .NET)

using System.Collections.Generic;
using System.Xml;

public static void Main(string[] args)
{
    List<string> matches = new List<string>();

    XmlDocument xd = new XmlDocument();
    xd.LoadXml("<html>...</html>");

    FindHrefs(xd.FirstChild, matches);
}

static void FindHrefs(XmlNode xn, List<string> matches)
{
    if (xn.Attributes != null && xn.Attributes["href"] != null)
        matches.Add(xn.Attributes["href"].InnerXml);

    foreach (XmlNode child in xn.ChildNodes)
        FindHrefs(child, matches);
}

will this work if the HTML is not valid xml (e.g. unclosed img tags)? — Chas. Owens, Apr 21 '09 at 18:42

Ward Werbrouck · Answer 19 · 2009-04-21T22:57:17.683

4

Language: PHP
Library: SimpleXML (and DOM)

<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);

$links = $xml->xpath('//a[@href]');
foreach($links as $link)
    echo $link['href']."\n";

edited Apr 21 '09 at 22:57

answered Apr 21 '09 at 22:08

Ward Werbrouck

1,442
18
32

score 3 · Answer 20 · edited Apr 21 '09 at 22:51

3

language: Python
library: lxml.html

import lxml.html

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
    if attribute == "href":
        print link

lxml also has a CSS selector class for traversing the DOM, which can make using it very similar to using JQuery:

for a in tree.cssselect('a[href]'):
    print a.get('href')

edited Apr 21 '09 at 22:51

Van Gale

43,536
9
71
81

answered Apr 21 '09 at 18:53

Adam

803
1
8
11

Hmm, I am getting "ImportError: No module named html" when I try to run this, is there something I need besides python-lxml? – Chas. Owens Apr 21 '09 at 19:01
Ah, I have version 1.3.6 and that comes with 2.0 and later – Chas. Owens Apr 21 '09 at 19:02
Indeed. I can provide an example of using lxml.etree to do the job as well if you like? lxml.html is a bit more tolerant of broken HTML. – Adam Apr 21 '09 at 19:26

score 3 · Answer 21 · edited Nov 24 '09 at 13:21

3

Language: Perl
Library : HTML::TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
}

edited Nov 24 '09 at 13:21

Peter Mortensen

30,738
21
105
131

answered Apr 21 '09 at 21:53

dfa

114,442
31
189
228

It was also incorrect, you must call $document->eof; if you use $document->parse($html); and would print empty lines when href wasn't set. – Chas. Owens Apr 22 '09 at 11:34
reverted to my original code; ->eof() is useless in this sample; also checking for href presence is pointless in this example – dfa Apr 22 '09 at 13:17
Is there a reason you don't want to use new_from_content? – Chas. Owens Apr 22 '09 at 17:51

score 3 · Answer 22 · edited Nov 24 '09 at 13:21

Language: Objective-C
Library: libxml2 + Matt Gallagher's libxml2 wrappers + Ben Copsey's ASIHTTPRequest

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
    NSData *response = [request responseData];
    NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
    [request release];
}
else 
    @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];

...

- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
    NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
    if (nodes != nil)
        return nodes;
    return nil;
}

score 1 · Answer 23 · answered Oct 28 '10 at 18:52

1

Language: PHP Library: DOM

<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);

$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
    echo $links->item($i)->getAttribute('href'), "\n";

Sometimes it's useful to put @ symbol before $doc->loadHTMLFile to suppress invalid html parsing warnings

answered Oct 28 '10 at 18:52

Entea

947
14
26

Almost identical to my PHP version ( http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser/774853#774853 ) You don't need the getAttribute call – Ward Werbrouck Oct 29 '10 at 12:20

score 1 · Answer 24 · answered Mar 25 '11 at 03:03

1

Language: Python
Library: HTQL

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print url, text;

Simple and intuitive.

answered Mar 25 '11 at 03:03

seagulf

380
3
5

laz · Answer 25 · 2012-08-05T15:03:06.530

Language: Java
Library: jsoup

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
        final Elements links = document.select("a[href]");
        for (final Element element : links) {
            System.out.println(element.attr("href"));
        }
    }
}

the Tin Man · Answer 26 · 2012-01-07T08:14:10.360

language: Ruby
library: Nokogiri

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }

puts hrefs

Which outputs:

/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:iana@iana.org?subject=General%20website%20feedback

This is a minor spin on the one above, resulting in an output that is usable for a report. I only return the first and last elements in the list of hrefs:

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }

puts hrefs
  .each_with_index                     # add an array index
  .minmax{ |a,b| a.last <=> b.last }   # find the first and last element
  .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output

  1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com

score 0 · Answer 27 · answered Jan 09 '14 at 15:10

Language: Coldfusion 9.0.1+

Library: jSoup

<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
    var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
    if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
}
return res; 
}   

//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">

Returns an array of structures, each struct contains an HREF and TEXT objects.

GabaGabaDev · Answer 28 · 2016-01-24T23:12:21.943

Language: JavaScript/Node.js

Library: Request and Cheerio

var request = require('request');
var cheerio = require('cheerio');

var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html);
        var anchorTags = $('a');

        anchorTags.each(function(i,element){
            console.log(element["attribs"]["href"]);
        });
    }
});

Request library downloads the html document and Cheerio lets you use jquery css selectors to target the html document.

score 0 · Answer 29 · answered Mar 22 '12 at 18:32

Using phantomjs, save this file as extract-links.js:

var page = new WebPage(),
    url = 'http://www.udacity.com';

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var results = page.evaluate(function() {
            var list = document.querySelectorAll('a'), links = [], i;
            for (i = 0; i < list.length; i++) {
                links.push(list[i].href);
            }
            return links;
        });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

run:

$ ../path/to/bin/phantomjs extract-links.js

Can you provide examples of parsing HTML?

29 Answers29

Linked