1

I need help writing a regular expression to parse a string of HTML to replace encoded quotations inside of the style attribute. There is content in my HTML string that contains the same encoded quote that should NOT be replaced (not inside style tags). Here's my failed RegEx:

/style=".*(")*.*"/ig

Obviously, this is wrong because I have very little skill when it comes to RegEx. For example, here is what I am trying to replace:

<p style="font-family:&quot;Times New Roman&quot; color: red; background:url(&quot;whatever&quot;);">test1</p><p style="font-family:&quot;Times New Roman&quot; color: blue;">THIS IS CONTENT &quot;DO NOT REPLACE!&quot;</p><p style="font-family:&quot;Times New Roman&quot; color: green;">test</p><p style="font-family:&quot;Times New Roman&quot; color: orange;">test2</p>

My desired output:

<p style="font-family:'Times New Roman' color: red; background:url('whatever');">test1</p><p style="font-family:'Times New Roman'; color: blue;">THIS IS CONTENT &quot;DO NOT REPLACE!&quot;</p><p style="font-family:'Times New Roman' color: green;">test</p><p style="font-family:'Times New Roman' color: orange;">test2</p>

All instances of &quot; should be replaced that are inside of style="…", but not the ones in the content areas of HTML tags. Any help here is greatly appreciated!

Enlico
  • 23,259
  • 6
  • 48
  • 102
  • 1
    Obligatory post about the [futility of parsing X/HTML with regular expressions](https://stackoverflow.com/a/1732454/62576) – Ken White Feb 21 '20 at 19:29
  • What language/tool are you using? From the [regex tag info](https://stackoverflow.com/tags/regex/info): "Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool." – Toto Feb 22 '20 at 13:37
  • [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. – Toto Feb 22 '20 at 13:37

2 Answers2

1

There are several problems with your regexp /style=".*(&quot;)*.*"/ig :

  • The character dot (.) will match anything, so (.)* will match all the way to the end of the string until it sees a double quote "

  • You specify (&quot;)* with a *, so it will match any style="...", even if there is no (&quot;) in the style.

To overcome this problem, I think that you need to specify what characters are accepted within style along with (&quot;), and it can happen any number of times within style.

A regexp like this will work:

regexp = /style="(([a-z0-9:-]|;|\s|\(|\))*(&quot;)([a-z0-9:-]|;|\s|\(|\))*)*"/i

A better version of it suggested by Toto in the comment:

regexp = /style="([a-z0-9:;\s()-]*(&quot;)[a-z0-9:;\s()-]*)*"/i

Here is a program I write in Ruby to test it:

st = %q(
  <p style="font-family:&quot;Times New Roman&quot; color: red; background:url(&quot;whatever&quot;);">test1</p>
  <p style="font-family:&quot;Times New Roman&quot; color: blue;">THIS IS CONTENT &quot;DO NOT REPLACE!&quot;</p>
  <p style="font-family:&quot;Times New Roman&quot; color: green;">test</p>
  <p style="font-family:&quot;Times New Roman&quot; color: orange;">test2</p>
  )

def replace_quotes_in_styles(st)
  regexp = /style="(([a-z0-9:-]|;|\s|\(|\))*(&quot;)([a-z0-9:-]|;|\s|\(|\))*)*"/i

  while (match_data = st.match(regexp)) do
    st = st.sub(match_data.to_s, match_data.to_s.gsub("&quot;", "'") )
  end

  st
end

puts replace_quotes_in_styles(st)

It will print some output like this:

<p style="font-family:'Times New Roman' color: red; background:url('whatever');">test1</p>
<p style="font-family:'Times New Roman' color: blue;">THIS IS CONTENT &quot;DO NOT REPLACE!&quot;</p>
<p style="font-family:'Times New Roman' color: green;">test</p>
<p style="font-family:'Times New Roman' color: orange;">test2</p>

Or more concise program:

 st = %q(
  <p style="font-family:&quot;Times New Roman&quot; color: red; background:url(&quot;whatever&quot;);">test1</p>
  <p style="font-family:&quot;Times New Roman&quot; color: blue;">THIS IS CONTENT &quot;DO NOT REPLACE!&quot;</p>
  <p style="font-family:&quot;Times New Roman&quot; color: green;">test</p>
  <p style="font-family:&quot;Times New Roman&quot; color: orange;">test2</p>
  )

def replace_quotes_in_styles(st)
  regexp = /style="([a-z0-9:;\s()-]*(&quot;)[a-z0-9:;\s()-]*)*"/i
  st.gsub(regexp) { |s| s.gsub("&quot;", "'") }
end

puts replace_quotes_in_styles(st)
Châu Hồng Lĩnh
  • 1,986
  • 1
  • 20
  • 23
0

What about the following PCRE?

/(?>style=")([^"]*?)&quot;(.*?)&quot;/g

The substitution string then has to be \1'\2'. Check it here.

Enlico
  • 23,259
  • 6
  • 48
  • 102
  • Checked it ... there's another part inside the style that isn't being sustituted: background:url("whatever") ... can you take a look at that part? I should have specified my need as ECMAScript but I didn't realize there was different syntax between the two. – Scott Simmons Feb 21 '20 at 19:21
  • Add it in your question, however, so other people can answer, because I do not know how to solve this problem (so I'll likely delete the answer as soon as someone posts a true answer; in the meanwhile it can be a starting point for you to try things out). Besides, you specified _but not the ones in the content areas of HTML tags_; please, give an example for this case too (in the question, not in the comments). – Enlico Feb 21 '20 at 19:34
  • 1
    It is provided in the example code you supplied, check the first line ... there's a "font-family", then "color", then "background" ... that CSS element contains the encoded quote as well. This is included in my original example and definitely not within the "content area" as mentioned. Thanks! – Scott Simmons Feb 21 '20 at 19:36
  • Apologies, I was tuned on the original post where the code formatting was a bit messed up. – Enlico Feb 21 '20 at 19:37