replacing html tag and its content using ruby gsub

Question

I am trying to replace a .. tag content in html content with empty string by doing the following.

string =  \n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "

When I did

string.gsub!(/<p.*?>|<\/p>/, '')

It just replaced the  and  with empty string but the content remained. How can I remove both the tag and its content ?

Obligatory: [**Do not parse HTML with regex**](https://stackoverflow.com/a/1732454/1954610). This might work for a "quick and dirty" solution, but the *right* way to do this is with an HTML parser. (e.g. Nokogiri, for ruby.) — Tom Lord, Oct 17 '18 at 11:44
Note that even though Onigmo (Ruby's regexp engine) is IMO more powerful than any other regexp engine except PCRE, and it would be possible to parse XHTML with it, HTML is not as easy: `
foo
quux
bar` is valid HTML where I can't think of a regexp solution that would do the correct thing (erase `
foo` and `
`, and leave `
quux
bar` alone). — Amadan, Oct 17 '18 at 12:39

An Nguyen · Answer 1 · 2018-10-17T11:54:43.483

0

Apparently, your regex does not match ... ( and its content). Try this:

string.gsub!(/.*<\/p>/, '')

test = '\n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "'
test.gsub(/<p>.*<\/p>/, '')

Return

"\\n <img alt=\\\"testing artice breaking news\\\" src=\\\"something.com\" />\\n \\n \""

Also, please consider @Tom Lord's comment, you can use Nokogiri to manipulate HTML.

edited Oct 17 '18 at 11:54

answered Oct 17 '18 at 11:44

An Nguyen

1,487
10
21

1

`string.gsub!` this returned nil. Without `bang` i.e `string.gsub` it returned same string . – user3576036 Oct 17 '18 at 11:48
@user3576036 you can try with the example, or you can edit the exact string that you want to replace – An Nguyen Oct 17 '18 at 11:55

score 0 · Answer 2 · answered Sep 19 '21 at 10:26

First of all, consider using HTML parsers when parsing HTML, see How do I remove a node with Nokogiri?.

If you want to do it with a regex, you can use

string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')

See the Rubular regex demo. This will work with tags that cannot be nested. Details:

<p(?:\s[^>]*)?> - <p, and an optional sequence of a whitespace and zero or more chars other than > (as many as possible), and then >
.*? - due to /m, any zero or more chars as few as possible
<\/p> -  string.

If the tags can be nested, you still can use a regex:

tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/
p string.gsub(rx, '')
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"

See the Rubular regex demo. Details:

<#{tagname} - < and tag name
(?:\s[^>]*)?> - an optional sequence of whitespace and then zero or more chars other than <
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)* - zero or more occurrences of
- (?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)* - zero or more chars other than < and then zero or more sequences of < that is not followed with tag name + > or whitespace or / + tag name + > followed with zero or more chars other than < chars
- |
- \g<0> - the whole regex pattern recursed
<\/#{tagname}> - </ + tag name + >.

See a Ruby demo:

string = "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n"
p string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')

tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/m
p string.gsub(rx, '')```
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"

replacing html tag and its content using ruby gsub

2 Answers2