4

Some friends and I have been working on a set of scripts that make it easier to do work on the machines at uni. One of these tools currently uses Nokogiri, but in order for these tools to run on all machines with as little setup as possible we've been trying to find a 'native' html parser, instead of requiring users to install RVM and custom gems (due to disk space limitations for most users).

Are we pretty much restricted to Nokogiri/Hpricot/? Should we look at just writing our own custom parser that fits our needs?

Cheers.

EDIT: If there's posts on here that I've missed in my searches, let me know! S.O. is sometimes just too large to find things effectively...

shearn89
  • 798
  • 1
  • 9
  • 24
  • 1
    Given that the gems are all open source, you can always extract what you need from a gem and use it in a custom parser, then you only have to deliver your own code... – Marc Talbot Feb 25 '12 at 15:59
  • I'd sure recommend against writing your own. – Dave Newton Feb 25 '12 at 16:03
  • It will be much more reliable to use existing soulutions. And what @MarcTalbot said above is key: if a gem is open-source, you can just copy the source into your application (assuming that you do not require non-GPL libraries). – Linuxios Feb 25 '12 at 16:16
  • It may be duplicate Q: http://stackoverflow.com/questions/2554909/method-to-parse-html-document-in-ruby – Mr. Black Feb 25 '12 at 16:39
  • Yeah, our only problem is that the whole suite of tools is currently about 5MB, so to add all the libs for nokogiri (for example) bumps the package up to about 7MB. We were hoping there might be something small! No worries though, I'll take a look at using existing stuff packaged up. – shearn89 Feb 25 '12 at 18:17

1 Answers1

2

There is no html parser in ruby stdlib
html parsers have to be more forgiving of bad markup than xml parsers

You could run the html though tidy (http://tidy.sourceforge.net)
to tidy up the html and produce valid markup
This can now be read via rexml :-) which is in stdlib

rexml is much slower than nokogiri, last checked in 2009
Sam Ruby had been working on making rexml faster though

A better way would be to have a better deployment
Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers

deepak
  • 7,230
  • 5
  • 24
  • 26
  • Thanks, the problem with deployment is that the tools get run on university-managed machines, so if we have to install anything it has to happen in the users home directory, which is limited to a certain amount of space: few people have enough room to install something like RVM with custom gems. This is also pure ruby, not Rails. – shearn89 Feb 26 '12 at 16:51
  • another option might be to create and consume an API. the advantage is that the code is deployed only on one machine - so space savings. but benchmark the speed of an api call – deepak Feb 28 '12 at 05:46
  • These aren't those sorts of tools - it's command line utilities that do things like wrap up `lpr` into an easy to use tool. Thanks though. – shearn89 Feb 28 '12 at 13:06