How to remove all the
from the first paragraph and last using regex

Question

I am trying to get rid of all the extra   in the first paragraph and last paragraph.

For example:

st = "<p><br><br><br><br>apple</p>
     <p>bananas</p>
     <p>orange<br><br><br><br><br></p>
     <p>tomatoes</p>
     <p>berry<br><br><br><br><br><br></p>"

I'm hoping to end up with this:

        "<p>apple</p>
         <p>bananas</p>
         <p>orange<br><br><br><br><br></p>
         <p>tomatoes</p>
         <p>berry</p>"

My goal is to leave the   middle paragraphs (ex. orange paragraph) alone and remove all the first paragraph   and all the end the last paragraph.

I've tried doing this regex:

st.sub(/^((<p>)|<br( \/)?>)*|(<p>|<br( \/)?>|< \/p>)*$/, '')

I get this:

=>  "<p>apple</p>
     <p>bananas</p>
     <p>orange<br><br><br><br><br></p>
     <p>tomatoes</p>
     <p>berry<br><br><br><br><br><br></p>"

I am unable to delete the last paragraph repeating  .

Rule one when working with HTML or XML, is to use a parser, not regular expressions. While there are trivial cases where it's possible, doing so is fragile and likely to break if the markup changes. See http://stackoverflow.com/q/1732348/128421. — the Tin Man, May 10 '17 at 19:06

the Tin Man · Accepted Answer · 2017-05-10T23:10:45.037

Don't use regular expressions. Instead use a parser:

require 'nokogiri'

doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
EOT

p_tags = doc.search('p')
[:first, :last].each { |s| p_tags.send(s).search('br').remove }
doc.to_html

Which would result in the fragment looking like:

# => "<p>apple</p>\n" +
#    "<p>bananas</p>\n" +
#    "<p>orange<br><br><br><br><br></p>\n" +
#    "<p>tomatoes</p>\n" +
#    "<p>berry</p>\n"

Parsers are much more able to cope with changing HTML so if you're going to do any HTML changes or scraping it pays off to learn how to use them.

An alternate way to do what you want without a parser or a complicated regex is:

str = <<EOT
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
EOT

str_lines = str.lines
[0, -1].each { |i| str_lines[i].gsub!(/<br>/, '') }
puts str_lines.join

Which results in the same thing.

The strength of the first method is that it won't care if the   mysteriously change to   as in HTML5, or  .

Finally, if you doubly insist on using a longer, more complicated, pattern, at least simplify it:

puts str.sub(/\A<p>(?:<br>)+/, '<p>').sub(/(?:<br>)+<\/p>\Z/, '</p>')

which results in the same thing again.

Regular expressions are great for some tasks, but they're not good for markup. If you insist on using a regular expression, then simplify the problem as in the later solutions because it reduces the complexity of the pattern, which improves readability and eases maintenance.

LukStorms · Answer 2 · 2017-05-11T08:31:33.633

st = st.gsub(/(?<=\A<p>)(<br\/?>)+|(<br\/?>)+(?=[<]\/p>\Z)/, '')

There's 2 parts seperated by a pipe (OR):

1) (?<=\A)(<br\/?>)+ matches 1 or more   that are after the start of the string (\A) and a  tag

2) (<br\/?>)+(?=[<]\/p>\Z) matches matches 1 or more   that are before a  closing tag at the end of the string (\Z)

And gsub because we want to replace all occurrences in the string, not just the first.
The g in gsub stands for global.

Cary Swoveland · Answer 3 · 2017-05-11T02:23:45.303

I suggest something simple that's easy to understand, test and maintain.

str =<<-_
<p><br><br><br><br>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry<br><br><br><br><br><br></p>
_
  #=> "<p><br><br><br><br>apple</p>\n<p>bananas</p>\n<p>orange<br><br><br><br><br></p>\n<p>tomatoes</p>\n<p>berry<br><br><br><br><br><br></p>\n"

first, *mid, last = str.lines    
first.gsub('<br>', '') << mid.join << last.gsub('<br>', '')
  #=> "<p>apple</p>\n<p>bananas</p>\n<p>orange<br><br><br><br><br></p>\n<p>tomatoes</p>\n<p>berry</p>\n" 
puts s
<p>apple</p>
<p>bananas</p>
<p>orange<br><br><br><br><br></p>
<p>tomatoes</p>
<p>berry</p>

Note that

first
  #=> "<p><br><br><br><br>apple</p>\n" 
mid
  #=> ["<p>bananas</p>\n",
  #    "<p>orange<br><br><br><br><br></p>\n",
  #    "<p>tomatoes</p>\n"]
last
  #=> "<p>berry<br><br><br><br><br><br></p>\n"

I noticed that this is very similar to the second part of @theTinMan's (earlier) answer. — Cary Swoveland, May 10 '17 at 23:37

How to remove all the from the first paragraph and last using regex

3 Answers3

How to remove all the
from the first paragraph and last using regex