Ruby Hpricot RegEx replace
's with
's

Question

Can someone please tell me how to convert this line of Javascript to Ruby using Hpricot & RegEx?

// Replace all doubled-up <BR> tags with <P> tags, and remove fonts.
    var pattern =  new RegExp ("<br/?>[ \r\n\s]*<br/?>", "g");
    document.body.innerHTML = document.body.innerHTML.replace(pattern, "</p><p>").replace(/<\/?font[^>]*>/g, '');

The code I have setup is:

require 'rubygems'
require 'hpricot'
require 'open-uri'

@file = Hpricot(open("http://www.bubl3r.com/article.html"))

Thanks

You realize that this is going to cause a soupy mess of unmatched `
` tags, right? What if you hit code like `
foo

bar
`? I urge you to reconsider; `
` is semantic, not presentational, and there's nothing wrong with `
` anyway. At the very least, avoid using regex for parsing HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Justin Morgan - On strike, Feb 24 '11 at 23:20

score 1 · Answer 1 · edited Feb 24 '11 at 23:12

The content for the OPs URL seems to have changed, as commonly happens on the internet, so I cobbled up some sample HTML to show how I'd go about this.

Also, Nokogiri is what I recommend as a Ruby HTML/XML parser because it's very actively supported, robust and flexible.

require 'nokogiri'

html = <<EOT
<html>
<body>
  some<br><br>text
  <font>
    text wrapped with font
  </font>
  some<br>more<br>text
</body>
</html>
EOT

doc = Nokogiri::HTML(html)

# Replace all doubled-up <BR> tags with <P> tags, and remove fonts.
doc.search('br').each do |n|
  if (n.previous.name == 'br')
    n.previous.remove 
    n.replace('<p>')
  end
end

doc.search('font').each do |n|
  n.replace(n.content)
end

print doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >>   some<p></p>text
# >>   
# >>     text wrapped with font
# >>   
# >>   some<br>more<br>text
# >> </body></html>

Is the problem with highlighting a known issue? – Andrew Grimm Feb 24 '11 at 23:13 — Andrew Grimm, Feb 24 '11 at 23:13

score 0 · Answer 2 · answered Feb 15 '11 at 09:03

0

I think that a better way to clean a html file is beautiful soup. I use it for python and it does a very good job because it emulate some part of the semantic of html browser.

http://www.crummy.com/software/RubyfulSoup/

answered Feb 15 '11 at 09:03

VGE

4,171
18
17

score 0 · Answer 3 · answered Feb 24 '11 at 06:51

0

Even though it won't produce valid HTML, something like this works:

require 'rubygems'
require 'hpricot'
require 'open-uri'

@file = Hpricot(open("http://www.bubl3r.com/article.html"))
puts @file.html.gsub('<br />', '<p>')

answered Feb 24 '11 at 06:51

TuteC

4,342
30
40

There's really no point using a parser like that. Either use `gsub` against the retrieved HTML directly, or use the parser for what it's designed to do and walk the DOM replacing `
` with `
` tags.
– the Tin Man Feb 24 '11 at 08:13

Ruby Hpricot RegEx replace 's with 's

3 Answers3

Ruby Hpricot RegEx replace
's with
's