Do you know an open source Javascript extraction/regexp engine?

Question

We are in need of a DOM parser, that will be able to run a bunch of patterns and would store the results. For this we are looking for libraries that are open and we can start on,

able to select elements by regexp (for example grab all elements that contain "price" either in class, id, other attributes like meta attributes),
should have a lot of helpers like: remove comments, iframes, etc
and be pretty fast.
can be run from browser extensions.

jquery is great and does all things. it is good too, as well. — goat, May 30 '12 at 18:29
I've never tried, but its just a javascript library. It should. — goat, May 30 '12 at 18:31

score 3 · Answer 1 · edited May 23 '17 at 12:04

Ok, I'll say it :
You can use jQuery.

ups :

it is a very good dom parser
it is very good at manipulating the dom (removing/adding/editing elements)
it has a great and intuitive api
it has a big & great community => lots of answers to any jquery related question
it works in browser extensions (tested it myself in chrome and it apparently works in ff extensions too : How to use jQuery in Firefox Extension)
it is lightweight (About 31KB in size - minified and gzipped)
it is cross-browser
it is definitely open source

downs :

it doesn't rely on regex (although this is a very good thing - as dda already mentioned), but regex can be used to filter the elements
dont know if it can access/manipulate comments

Here's an example of some jquery action :

// select all the iframe elements with the class advertisement 
// that have the word "porn" in their src attribute
$('iframe.advertisement[src*=porn]')
    // filter the ones that contains the word "poney" in their title 
    // with the help of a regex
    .filter(function(){
        return /poney/gi.test((this.title || this.document.title).test()));
    }) 
        // and remove them
        .remove()
        // return to the whole match
        .end()
    // filter them again, this time 
    // affect only the big ones
    .filter(function(){
        return $(this).width() > 100 && $(this).height() > 100;
    })
        // replace them with some html markup
        .replaceWith('<img src="harmless_bunnies_and_kitties.jpg" />');

Does it work in browser extensions? – Pentium10 May 30 '12 at 18:28 — Pentium10, May 30 '12 at 18:28
yes, it does work quite nicely. see update. – gion_13 May 30 '12 at 18:34 — gion_13, May 30 '12 at 18:34

score 0 · Accepted Answer · answered May 30 '12 at 19:19

node-htmlparser can parse HTML, provides a DOM with a number of utils (also supports filtering by functions) and can be run in any context (even in WebWorkers).

I forked it a while back, improved it for better speed and got some insane results (read: even faster than native libexpat bindings).

Nevertheless, I would advice you to use the original version, as it supports browsers out-of-the-box (my fork can be run in browsers using browserify, which adds some overhead).

Do you know an open source Javascript extraction/regexp engine?

2 Answers2