First of all, consider using HTML parsers when parsing HTML, see How do I remove a node with Nokogiri?.
If you want to do it with a regex, you can use
string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
See the Rubular regex demo. This will work with tags that cannot be nested. Details:
<p(?:\s[^>]*)?>
- <p
, and an optional sequence of a whitespace and zero or more chars other than >
(as many as possible), and then >
.*?
- due to /m
, any zero or more chars as few as possible
<\/p>
- </p>
string.
If the tags can be nested, you still can use a regex:
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/
p string.gsub(rx, '')
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"
See the Rubular regex demo. Details:
<#{tagname}
- <
and tag name
(?:\s[^>]*)?>
- an optional sequence of whitespace and then zero or more chars other than <
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*
- zero or more occurrences of
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*
- zero or more chars other than <
and then zero or more sequences of <
that is not followed with tag name + >
or whitespace or /
+ tag name + >
followed with zero or more chars other than <
chars
|
\g<0>
- the whole regex pattern recursed
<\/#{tagname}>
- </
+ tag name + >
.
See a Ruby demo:
string = "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n"
p string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/m
p string.gsub(rx, '')```
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"
foo
- quux
bar` is valid HTML where I can't think of a regexp solution that would do the correct thing (erase `foo` and `
`, and leave `- quux
bar` alone). – Amadan Oct 17 '18 at 12:39