2

I am trying to replace a <p>..</p> tag content in html content with empty string by doing the following.

string =  \n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n " 

When I did

string.gsub!(/<p.*?>|<\/p>/, '')

It just replaced the <p> and </p> with empty string but the content remained. How can I remove both the tag and its content ?

Quv
  • 2,958
  • 4
  • 33
  • 51
user3576036
  • 1,335
  • 2
  • 23
  • 51
  • 2
    Obligatory: [**Do not parse HTML with regex**](https://stackoverflow.com/a/1732454/1954610). This might work for a "quick and dirty" solution, but the *right* way to do this is with an HTML parser. (e.g. Nokogiri, for ruby.) – Tom Lord Oct 17 '18 at 11:44
  • Note that even though Onigmo (Ruby's regexp engine) is IMO more powerful than any other regexp engine except PCRE, and it would be possible to parse XHTML with it, HTML is not as easy: `

    foo

    • quux
    bar` is valid HTML where I can't think of a regexp solution that would do the correct thing (erase `

    foo` and `

    `, and leave `
    • quux
    bar` alone).
    – Amadan Oct 17 '18 at 12:39

2 Answers2

0

Apparently, your regex does not match <p>...</p> (<p> and its content). Try this:

string.gsub!(/<p>.*<\/p>/, '')

test = '\n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "'
test.gsub(/<p>.*<\/p>/, '')

Return

"\\n <img alt=\\\"testing artice breaking news\\\" src=\\\"something.com\" />\\n \\n \""

Also, please consider @Tom Lord's comment, you can use Nokogiri to manipulate HTML.

An Nguyen
  • 1,487
  • 10
  • 21
0

First of all, consider using HTML parsers when parsing HTML, see How do I remove a node with Nokogiri?.

If you want to do it with a regex, you can use

string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')

See the Rubular regex demo. This will work with tags that cannot be nested. Details:

  • <p(?:\s[^>]*)?> - <p, and an optional sequence of a whitespace and zero or more chars other than > (as many as possible), and then >
  • .*? - due to /m, any zero or more chars as few as possible
  • <\/p> - </p> string.

If the tags can be nested, you still can use a regex:

tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/
p string.gsub(rx, '')
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"

See the Rubular regex demo. Details:

  • <#{tagname} - < and tag name
  • (?:\s[^>]*)?> - an optional sequence of whitespace and then zero or more chars other than <
  • (?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)* - zero or more occurrences of
    • (?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)* - zero or more chars other than < and then zero or more sequences of < that is not followed with tag name + > or whitespace or / + tag name + > followed with zero or more chars other than < chars
    • |
    • \g<0> - the whole regex pattern recursed
  • <\/#{tagname}> - </ + tag name + >.

See a Ruby demo:

string = "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n"
p string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')

tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/m
p string.gsub(rx, '')```
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563