1

I am trying to get rid of all the extra <br> in the first paragraph and last paragraph.

For example:

st = "<p><br><br><br><br>apple</p>
     <p>bananas</p>
     <p>orange<br><br><br><br><br></p>
     <p>tomatoes</p>
     <p>berry<br><br><br><br><br><br></p>"

I'm hoping to end up with this:

        "<p>apple</p>
         <p>bananas</p>
         <p>orange<br><br><br><br><br></p>
         <p>tomatoes</p>
         <p>berry</p>"

My goal is to leave the <br> middle paragraphs (ex. orange paragraph) alone and remove all the first paragraph <br> and all the end the last paragraph.

I've tried doing this regex:

st.sub(/^((<p>)|<br( \/)?>)*|(<p>|<br( \/)?>|< \/p>)*$/, '')

I get this:

=>  "<p>apple</p>
     <p>bananas</p>
     <p>orange<br><br><br><br><br></p>
     <p>tomatoes</p>
     <p>berry<br><br><br><br><br><br></p>"

I am unable to delete the last paragraph repeating <br>.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
myhouse
  • 1,911
  • 1
  • 17
  • 24
  • 4
    Rule one when working with HTML or XML, is to use a parser, not regular expressions. While there are trivial cases where it's possible, doing so is fragile and likely to break if the markup changes. See http://stackoverflow.com/q/1732348/128421. – the Tin Man May 10 '17 at 19:06

3 Answers3

4

Don't use regular expressions. Instead use a parser:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
EOT

p_tags = doc.search('p')
[:first, :last].each { |s| p_tags.send(s).search('br').remove }
doc.to_html

Which would result in the fragment looking like:

# => "<p>apple</p>\n" +
#    "<p>bananas</p>\n" +
#    "<p>orange<br><br><br><br><br></p>\n" +
#    "<p>tomatoes</p>\n" +
#    "<p>berry</p>\n"

Parsers are much more able to cope with changing HTML so if you're going to do any HTML changes or scraping it pays off to learn how to use them.

An alternate way to do what you want without a parser or a complicated regex is:

str = <<EOT
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
EOT

str_lines = str.lines
[0, -1].each { |i| str_lines[i].gsub!(/<br>/, '') }
puts str_lines.join

Which results in the same thing.

The strength of the first method is that it won't care if the <br> mysteriously change to <br/> as in HTML5, or <br >.

Finally, if you doubly insist on using a longer, more complicated, pattern, at least simplify it:

puts str.sub(/\A<p>(?:<br>)+/, '<p>').sub(/(?:<br>)+<\/p>\Z/, '</p>')

which results in the same thing again.

Regular expressions are great for some tasks, but they're not good for markup. If you insist on using a regular expression, then simplify the problem as in the later solutions because it reduces the complexity of the pattern, which improves readability and eases maintenance.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
2
st = st.gsub(/(?<=\A<p>)(<br\/?>)+|(<br\/?>)+(?=[<]\/p>\Z)/, '')

There's 2 parts seperated by a pipe (OR):

1) (?<=\A<p>)(<br\/?>)+ matches 1 or more <br> that are after the start of the string (\A) and a <p> tag

2) (<br\/?>)+(?=[<]\/p>\Z) matches matches 1 or more <br> that are before a </p> closing tag at the end of the string (\Z)

And gsub because we want to replace all occurrences in the string, not just the first.
The g in gsub stands for global.

LukStorms
  • 28,916
  • 5
  • 31
  • 45
1

I suggest something simple that's easy to understand, test and maintain.

str =<<-_
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
_
  #=> "<p><br><br><br><br>apple</p>\n<p>bananas</p>\n<p>orange<br><br><br><br><br></p>\n<p>tomatoes</p>\n<p>berry<br><br><br><br><br><br></p>\n"

first, *mid, last = str.lines    
first.gsub('<br>', '') << mid.join << last.gsub('<br>', '')
  #=> "<p>apple</p>\n<p>bananas</p>\n<p>orange<br><br><br><br><br></p>\n<p>tomatoes</p>\n<p>berry</p>\n" 
puts s
<p>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry</p>

Note that

first
  #=> "<p><br><br><br><br>apple</p>\n" 
mid
  #=> ["<p>bananas</p>\n",
  #    "<p>orange<br><br><br><br><br></p>\n",
  #    "<p>tomatoes</p>\n"]
last
  #=> "<p>berry<br><br><br><br><br><br></p>\n" 
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100