0

We are in need of a DOM parser, that will be able to run a bunch of patterns and would store the results. For this we are looking for libraries that are open and we can start on,

  • able to select elements by regexp (for example grab all elements that contain "price" either in class, id, other attributes like meta attributes),
  • should have a lot of helpers like: remove comments, iframes, etc
  • and be pretty fast.
  • can be run from browser extensions.
Wladimir Palant
  • 56,865
  • 12
  • 98
  • 126
Pentium10
  • 204,586
  • 122
  • 423
  • 502

2 Answers2

3

Ok, I'll say it :
You can use jQuery.

ups :

  • it is a very good dom parser
  • it is very good at manipulating the dom (removing/adding/editing elements)
  • it has a great and intuitive api
  • it has a big & great community => lots of answers to any jquery related question
  • it works in browser extensions (tested it myself in chrome and it apparently works in ff extensions too : How to use jQuery in Firefox Extension)
  • it is lightweight (About 31KB in size - minified and gzipped)
  • it is cross-browser
  • it is definitely open source

downs :

  • it doesn't rely on regex (although this is a very good thing - as dda already mentioned), but regex can be used to filter the elements
  • dont know if it can access/manipulate comments

Here's an example of some jquery action :

// select all the iframe elements with the class advertisement 
// that have the word "porn" in their src attribute
$('iframe.advertisement[src*=porn]')
    // filter the ones that contains the word "poney" in their title 
    // with the help of a regex
    .filter(function(){
        return /poney/gi.test((this.title || this.document.title).test()));
    }) 
        // and remove them
        .remove()
        // return to the whole match
        .end()
    // filter them again, this time 
    // affect only the big ones
    .filter(function(){
        return $(this).width() > 100 && $(this).height() > 100;
    })
        // replace them with some html markup
        .replaceWith('<img src="harmless_bunnies_and_kitties.jpg" />');
Community
  • 1
  • 1
gion_13
  • 41,171
  • 10
  • 96
  • 108
0

node-htmlparser can parse HTML, provides a DOM with a number of utils (also supports filtering by functions) and can be run in any context (even in WebWorkers).

I forked it a while back, improved it for better speed and got some insane results (read: even faster than native libexpat bindings).

Nevertheless, I would advice you to use the original version, as it supports browsers out-of-the-box (my fork can be run in browsers using browserify, which adds some overhead).

fb55
  • 1,197
  • 1
  • 11
  • 16