1

I am receiving some result data as follow:

\u003cdiv\u003esome message comes here\u003c/div\u003e

And I need to parse it back, which is easily done with:

string result = HttpUtility.HtmlDecode(Regex.Unescape(data));

However if there is a regex within the string, for example:

\u003cdiv\u003esome message \w+ comes here\u003c/div\u003e

It will throw an error:

parsing "\u003cdiv\u003esome message \w+ comes here\u003c/div\u003e" - Unrecognized escape sequence \w.

I don't need the regex that was within the text to be processed or anything in fact that can be take literally.

How can I convert:

\u003cdiv\u003esome message \w+ comes here\u003c/div\u003e

Back to normal?

<div>some message \w+ comes here</div>

NOTE: I've looked around but found no answer directed to this, I did found answers telling people to use @ however the data is not inputted by me but received from else where so I don't think I can to do string data = @receivedData; AFAIK.

Guapo
  • 3,446
  • 9
  • 36
  • 63

2 Answers2

1

There are two separate escape types intermixed here. You can try this:

Regex.Unescape(Regex.Replace(data, "\\\\([^u])", "\\\\$1"))

This will preserve the \u... values but escape the other backslashes.

If you do this operation often, you'll want to make a Regex pattern instance and reuse it every call:

Regex regex = new Regex("\\\\([^u])"); // Reuse this instance

// When parsing the data:
Regex.Unescape(regex.Replace(data, "\\\\$1"));
Nick Gotch
  • 9,167
  • 14
  • 70
  • 97
1

The problem here is you're trying to apply Regex.Unescape to something which wasn't entirely processed with Regex.Escape. The same problem would be encountered with just about any encoding where you had a message partially encoded and other parts not encoded. You can try to anticipate all the variations, but there will be cases where you will be unable to distinguish between something that was intended to be unecoded, and other things which are not escaped. The only sure fire way is to ensure the entire message is consistently encoded. This means completely decoding the message anytime you perform manipulations on the string, and then re-encoding the entire string.

Here is a demonstration I did in linqpad with output to follow for each corresponding .Dump(). It does the full encoding and then complete decoding. You'll notice half way through the \w gets escaped when Regex Encoding. So the crux of the issue you are having is that the "some message \w+ here" part of the message was not Regex Encoded, so applying Regex.Unescape to it is going to fail because you can't unescape something that's not escaped.

string ori = @"<div>some message \w+ here</div>"; //only escaping is \\ for the C# string which is really \

ori.Dump(); // Verify that real string is "<div>some message \w+ here</div>"

string regexEscaped = System.Text.RegularExpressions.Regex.Escape(ori);

regexEscaped.Dump();    

//Regex escape does not replace "<" with unicode characters as it seems an unnecesary escape sequence.  I can force them into the regex encoded string
//This step is unnecesary and can be commented out.
//regexEscaped = regexEscaped.Replace(">", @"\u003e").Replace("<",@"\u003c");    
//regexEscaped.Dump();

string htmlEscaped_regexEscaped = System.Web.HttpUtility.HtmlEncode(regexEscaped).Dump();

System.Text.RegularExpressions.Regex.Unescape( System.Web.HttpUtility.HtmlDecode(htmlEscaped_regexEscaped)).Dump();
// Since we encoded the entire string we were able to successfully decode it.

Output:

 Original: <div>some message \w+ here</div>
Rgx Escpd: <div>some\ message\ \\w\+\ here</div>
HTML Encd: &lt;div&gt;some\ message\ \\w\+\ here&lt;/div&gt;
HTML Uncd & Rgx Unesc: <div>some message \w+ here</div>

Are you using this for matching?

If your intent is to use the string "\u003cdiv\u003esome message \w+ comes here\u003c/div\u003e" as a Regex expressiong for performing matching, there is no need to do anything to it. The matcher implementing the full regex feature set should understand "\u003c" and so there is no need to attempt to convert that to "<":

http://www.regular-expressions.info/unicode.html

The client isn't really doing a Regex Escape?

It seem more likely that the client isn't really doing a regex escape, and thus Regex.Unescape is certain to fail. Is it doing some sort of Html Encode but replacing the characters with unicode codes instead of HTML character codes? Maybe. Without having documented behavior for the client, it is an educated guess and hope that they don't produce other inconsistent encodings later down the line.

In this case, I would just target the unicode escape sequences. Here is a question that covers the topic of replacing unicode escape sequences and not use Regex.Unescape:

How do convert unicode escape sequences to unicode characters in a .NET string

Community
  • 1
  • 1
AaronLS
  • 37,329
  • 20
  • 143
  • 202
  • No that's no my intention like I have mentioned, I received the data like that from some API and what I am trying to achieve is put the message back to its original state and then remove the HTML from it. I appreciate your explanation on why it fails but beyond that it was rather over complicated also I can't do a specific replace for every thing tag it will convert as I don't know what will be converted. – Guapo Aug 06 '14 at 18:25
  • The short: you can't use unescape/decode on something that wasn't for certain paired with an escape/encode at some point. It is tempting to leverage Unescape as a tool in this situation, but it is bound to fail if the source string wasn't Escaped/Encoded the same way. I didn't realize you didn't have control over the client, and assumed you intended to use it as a regex expression since you were using Regex.Unescape, so I assumed it was an issue with how the string was formed. – AaronLS Aug 06 '14 at 20:35
  • 1
    @Guapo To just target the unicode chars see: http://stackoverflow.com/questions/183907/how-do-convert-unicode-escape-sequences-to-unicode-characters-in-a-net-string – AaronLS Aug 06 '14 at 20:37