2

I'm writing a web application in ASP.NET. I need help with regular expressions. I need two expressions, the first one that can help me get and finally replace every double quote character that is in HTML tag with single quote, and the second one that can get and replace every double quote that is not a part of HTML tag with ".

For example:

<p>This is a "wonderful long text". "Another wonderful ong text"</p> At least it should be. Here we have a <a href="http://wwww.site-to-nowhere.com" target="_blank">link</a>

Should be changed like so.

<p>This is a &quot;wonderful long text&quot;. &quot;Another wonderful ong text&quot;</p> At least it should be. Here we have a <a href='http://wwww.site-to-nowhere.com' target='_blank'>link</a>

I have tried the following expression:

"([^<>]*?)"(?=[^>]+?<)

But the problem is that it cannot catch the "Another wonderful ong text" probably because its next to the </p> tag.

Can you help me with this problem? Or maybe are there any other solutions to resolve this replacement problem in .NET?

Regular Jo
  • 5,190
  • 3
  • 25
  • 47
Roman Suska
  • 527
  • 2
  • 7
  • 21
  • why do want to do so? – शेखर Feb 13 '15 at 12:14
  • 3
    Not a trivial task to do reliably. For example, you'll need to handle tags such as: `

    ..

    `. (Notice the single quote inside the double quoted attribute value.)
    – ridgerunner Feb 13 '15 at 18:16
  • I need to do this, because we're using rich text editor control from a company, it's created through JavaScript and there are problems with inputting HTML from model for example in edit action (JavaScript is using double quotes and every double quote from the text is recognized as end of the text, that's why I need to replace them). ridgerunner, it's not a problem because I don't expect that there will be such a case in rich text editor. – Roman Suska Feb 17 '15 at 08:11

4 Answers4

3

Don't use regex to parse HTML. I can recommend HtmlAgilityPack:

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);  // html is your HTML-string
var textNodes = doc.DocumentNode.SelectNodes("//text()");
foreach (HtmlAgilityPack.HtmlTextNode node in textNodes)
{
    node.Text = node.Text.Replace("\"", "&quot;");
}
StringWriter sw = new StringWriter();
doc.Save(sw);
string result = sw.ToString();

I've tested it with your sample HTML, this is the (desired) result:

<p>This is a &quot;wonderful long text&quot;. &quot;Another wonderful ong text&quot;</p> At least it should be. Here we have a <a href="http://wwww.site-to-nowhere.com" target="_blank">link</a>
Community
  • 1
  • 1
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • 1
    I know that's the goto answer, but in this very simple case, there's nothing wrong with using regex. It's no different than if a user wanted to find commas outside of parentheses. You can hardly call the OP's task parsing. – Regular Jo Feb 13 '15 at 18:09
1

I would do this

Find: "(?=[^<]*>)
Replace: '

Find: "(?=[^>]*<)
Replace: &quot;

Although, is it necessary to even use the first regex? The second should do the job fine and leave double-quoted-tag-attributes alone. As smimov says, once one side of your quotes are replaced, you can just do a generic replace for the rest. I only provide two regexes because you may find the first not even necessary.

Further, as Ridgerunner's comment points out

Not a trivial task to do reliably. For example, you'll need to handle tags such as: <p title="Can't put this in single quotes!">..</p>. (Notice the single quote inside the double quoted attribute value.)

That's a very valid point. If you don't NEED single quotes here, I frankly wouldn't use them.

There are many many instances where you don't want to use regex to parse html but this is a very very simple case and I see nothing wrong with using regex here. This is no different than "looking for a comma outside of parentheses", which would see a plethora of answers.

But yes, indeed, more complex html pattern matching in regex is a very difficult/nigh-impossible task that is a leading cause in manual-hair-extraction baldness ages 18-$max(myage,50).

Regular Jo
  • 5,190
  • 3
  • 25
  • 47
0

You can

  1. replace quotes inside tags
  2. replace remaining quotes everywhere

Example

Regex rx = new Regex("<.*?>");
string result = rx.Replace(text, 
                       new MatchEvaluator(ReplaceLink)).Replace("\"", "&quot;");

...
static string ReplaceLink(Match m)
{
    return m.ToString().Replace("\"", "'");
}

Demo: https://dotnetfiddle.net/5qkXaE

user2316116
  • 6,726
  • 1
  • 21
  • 35
0

Although this is no longer relevant, this option is possible on the question asked (for example, in the implementation in PHP> 5.2):

Your HTML code example.

    $cHTML = '<p>This is a "wonderful long text". "Another wonderful ong text"</p>'.
             ' At least it should be. Here we have a '.
             '<a href="http://wwww.site-to-nowhere.com" target="_blank">link</a>';

    // Let's transform it as you wanted.
    $cHTML = str_replace( '"','&quote;', 
                          preg_replace_callback('/[^\s][=].*?"(.*?)"/ui',
                                   function ($matches) {
                                     return str_replace( '"'.$matches[1].'"',
                                                         "'".$matches[1]."'", 
                                                             $matches[0]);
                                   }, $cHTML) 
                        );

    // Let's show the result.
    var_dump( $cHTML );
    

You will receive your "weird" HTML code:

<p>This is a &quote;wonderful long text&quote;. &quote;Another wonderful ong text&quote;</p> At least it should be. Here we have a <a href='http://wwww.site-to-nowhere.com' target='_blank'>link</a>