Questions tagged [hpricot]

Hpricot is a Ruby library intended for parsing HTML. Until the release of Nokogiri, a competing HTML and css parser, Hpricot was the defacto HTML parser for the ruby community.

Hpricot is a Ruby library intended for parsing HTML. Until the release of Nokogiri, a competing HTML and css parser, Hpricot was the defacto HTML parser for the ruby community.

163 questions
28
votes
4 answers

How do I do a regex search in Nokogiri for text that matches a certain beginning?

Given: require 'rubygems' require 'nokogiri' value = Nokogiri::HTML.parse(<<-HTML_END) "

A

Foo

B

C

Bar

bcolfer
  • 639
  • 1
  • 6
  • 15
24
votes
3 answers

Nokogiri vs Hpricot?

Which one would you choose? My important attributes are (not in order): Support and future enhancements. Community and general knowledge base (on the Internet). Comprehensive (I.E., proven to parse a wide range of *.*ml pages). Performance. Memory…
roshan
  • 1,323
  • 18
  • 31
15
votes
5 answers

Installing Hpricot on Ruby 1.9.1 on Windows

I am trying to install hpricot using the command: >gem install hpricot -v 0.8.2 Building native extensions. This could take a while... ERROR: Error installing hpricot: ERROR: Failed to build gem native extension. C:/Ruby19/bin/ruby.exe…
Marcus
  • 265
  • 2
  • 7
12
votes
1 answer

open-uri is not redirecing http to https

I am using Hpricot and OpenURI to parse webpages and extract URLs from them. When I get a link like "http:rapidshare.com", it is not redirecting to https. This is the error I…
leonidus
  • 363
  • 1
  • 3
  • 11
9
votes
4 answers

Strip text from HTML document using Ruby

There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly. What I am trying to do is the opposite, remove all the text from an HTML…
davidsmalley
  • 1,029
  • 3
  • 10
  • 15
7
votes
2 answers

Using Ruby with Mechanize to log into a website

I need to scrape data from a site, but it requires my login first. I've been using hpricot to successfully scrape other sites, but I'm new to using mechanize, and I'm truly baffled by how to work it. I see this example commonly quoted: require…
Spacew00t
  • 73
  • 1
  • 1
  • 3
7
votes
3 answers

Can any of Ruby's HTML Parsers do JavaScript to see the resulting DOM?

When trying Hpricot and Nokogiri, the HTML can be fetched and parsed, but can they also execute the Javascript as well so that the content shows on the page? (shows up in the the DOM). That's because some page won't show the info unless the…
nonopolarity
  • 146,324
  • 131
  • 460
  • 740
6
votes
2 answers

What is a "terminated object", and why can't I call methods on it?

Periodically I get this exception: NotImplementedError: method `at' called on terminated object on this line of code: next if Hpricot(html).at('a') What does this error mean? How can I avoid it?
Tom Lehman
  • 85,973
  • 71
  • 200
  • 272
5
votes
1 answer

Convert HTML to plain text and maintain structure/formatting, with ruby

I'd like to convert html to plain text. I don't want to just strip the tags though, I'd like to intelligently retain as much formatting as possible. Inserting line breaks for
tags, detecting paragraphs and formatting them as such, etc. The…
John Bachir
  • 22,495
  • 29
  • 154
  • 227
5
votes
2 answers

Ruby Mechanize table scraping doesn't capture entire row

I am trying to scrape a table website with mechanize. I want to scrape the second row. When I run : agent.page.search('table.ea').search('tr')[-2].search('td').map{ |n| n.text } I would expect it to scrape the whole row. But instead it only scrapes:…
Rails beginner
  • 14,321
  • 35
  • 137
  • 257
5
votes
3 answers

how does one remove tags from around text in XML using Hpricot?

i just want the text out of there with out those tags. Does Hrpicot.XML have any methods for this?
loosecannon
  • 7,683
  • 3
  • 32
  • 43
5
votes
1 answer

Where can I find Hpricot documentation?

Now that http://github.com/why/hpricot/wikis/home no longer exists.
Tom Lehman
  • 85,973
  • 71
  • 200
  • 272
4
votes
2 answers

Failing to extract html table rows

I try to extract all five rows listed in the table above. I'm using Ruby hpricot library to extract the table rows using xpath expression. In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag…
Terry Li
  • 16,870
  • 30
  • 89
  • 134
4
votes
5 answers

Looking for a recommendation of a good tutorial on best practices for a web scraping project?

I need to do a fairly extensive project involving web scraping and am considering using Hpricot or Beautiful Soup (i.e. Ruby or Python). Has anyone come across a tutorial that they thought was particularly good on this subject that would help me…
in bruges
4
votes
1 answer

CSS selector exclude elements, hpricot

I am trying to write a CSS selector that select everything except the script elements with hpricot, I can easily select the all the contents of the select-me div and then remove the script elements but I was wondering if its possible to use a…
RailsSon
  • 19,897
  • 31
  • 82
  • 105
1
2 3
10 11