1

This is stumping me. I have a string which is a verbose piece of XHTML:

irb(main):012:0> input = <<-END
irb(main):013:0" <p><span class=\"caps\">ICES</span> evaluated the management plan in 2009
 and found it to be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being based on lengths, excludes the problem connected with age estimation.</p>\n<p><span class=\"caps\">SSB</span> 
 index is estimated to have decreased by more than 20% between the periods 2010–2012 
 (average of the three years) and 2013–2014 (average of the two years).</p>\n<p>A candidate 
 multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p><pre><code><p>
 The management plan, agreed October 2007 and implemented January 2008 was evaluated by 
 <span class=\"caps\">ICES</span> as to its accordance with the precautionary approach and 
 reviewed by three independent scientists.</p>\n<p>As the strong 2005 and 2006 year classes 
 enter the fishery discarding is expected to further increase, justifying the implementation 
 of measures to improve gear selectivity, such as increases in mesh size 
 (<span class=\"caps\">ICES</span>, 2009a).</p></code></pre>
irb(main):014:0" END
=> "<p><span class=\"caps\">ICES</span> evaluated the management plan in 2009 and found it to 
 be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being based 
 on lengths, excludes the problem connected with age estimation.</p>\n<p><span class=\"caps\">SSB
 </span> index is estimated to have decreased by more than 20% between the periods 2010–2012 
 (average of the three years) and 2013–2014 (average of the two years).</p>\n<p>A candidate 
 multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p><pre><code><p>The 
 management plan, agreed October 2007 and implemented January 2008 was evaluated by <span 
 class=\"caps\">ICES</span> as to its accordance with the precautionary approach and reviewed 
 by three independent scientists.</p>\n<p>As the strong 2005 and 2006 year classes enter the 
 fishery discarding is expected to further increase, justifying the implementation of 
 measures to improve gear selectivity, such as increases in mesh size (<span class=\"caps\">ICES
 </span>, 2009a).</p></code></pre>\n"

Now I want to strip out the text contained in the <pre><code> tags but it fails:

irb(main):015:0> input.gsub(/<pre>.*<\/pre>/,'')
=> "<p><span class=\"caps\">ICES</span> evaluated the management plan in 2009 and found it
 to be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being 
 based on lengths, excludes the problem connected with age estimation.</p>\n<p><span 
 class=\"caps\">SSB</span> index is estimated to have decreased by more than 20% between the 
 periods 2010–2012 (average of the three years) and 2013–2014 (average of the two years).</p>\n
 <p>A candidate multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p><pre>
 <code><p>The management plan, agreed October 2007 and implemented January 2008 was evaluated 
 by <span class=\"caps\">ICES</span> as to its accordance with the precautionary approach 
 and reviewed by three independent scientists.</p>\n<p>As the strong 2005 and 2006 year classes 
 enter the fishery discarding is expected to further increase, justifying the implementation 
 of measures to improve gear selectivity, such as increases in mesh size (<span class=\"caps\">ICES</span>, 2009a).</p></code></pre>\n"

If I strip out the newlines first, then it does:

irb(main):016:0> input.gsub(/\n/,'').gsub(/<pre>.*<\/pre>/,'')
=> "<p><span class=\"caps\">ICES</span> evaluated the management plan in 2009 and found it 
 to be in accordance with the PA. However, the <span class=\"caps\">SSB</span> index , being 
 based on lengths, excludes the problem connected with age estimation.</p><p><span 
 class=\"caps\">SSB</span> index is estimated to have decreased by more than 20% between the 
 periods 2010–2012 (average of the three years) and 2013–2014 (average of the two years).</p>
 <p>A candidate multispecies F<sub><span class=\"caps\">MSY</span></sub> was estimated.</p>"

What am I missing?

mezza
  • 63
  • 5
  • Did you try it with the multiple line modifier? (That's `m` if you don't have docs handy.) Ignoring for now HTML + Regex = Three. – Dave Newton Jan 12 '16 at 18:55
  • Dave, you sir are a star. Thank you. – mezza Jan 12 '16 at 19:01
  • When asking a question we need the *minimal* data to demonstrate the problem. You could easily reduce the input to a very short string. You make it harder for us to answer you when you don't do that, plus you make it harder for anyone else to understand when they're looking for a similar solution. – the Tin Man Jan 12 '16 at 23:46
  • When working with HTML or XML, [don't use Regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) unless it's an extremely trivial string. Instead, use a parser; [Nokogiri](http://nokogiri.org) is the defacto-standard for Ruby, and makes short work of dealing with parsing and modifying HTML/XML. – the Tin Man Jan 12 '16 at 23:48

2 Answers2

2

Try this:

input.gsub(/<pre>.*<\/pre>/m,'')

The m switch tells regex to treat input as multi-line.

0

It's not clear what you want. Do you want to remove the text from inside the <pre><code> block, or do you want to remove the text and wrapping tags?

This removes the content (text) from inside the block:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<pre><code><p>foo</p></code></pre>
EOT

doc.search('pre code').each do |pc|
  pc.content = ''
end

puts doc.to_html 
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <pre><code></code></pre>
# >> </body></html>

And this removes the content and <code> tags:

doc.search('pre code').each do |pc|
  pc.remove
end

puts doc.to_html 

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <pre></pre>
# >> </body></html>

You can remove the <pre> tags which will also remove the <code> tags and content instead:

doc.search('pre').each do |pc|
  pc.remove
end

puts doc.to_html        

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> </body></html>

Except for trivial use-cases where the HTML is very simple, you should rely on a parser. gsub and regular-expressions will lead you down a path until the HTML changes and your code explodes, or worse, simply does the wrong thing and returns bad results.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303