Unable to identify the encoding of the text C#

Question

I have a xml file which contains information about a customer like FirstName, LastName, Address etc. The text in any of the field might be bold or italics but the problem is, it is converted to code like "55349;56400;ℎ55349;56398;55349;56415;55349;56409;55349;56412;55349;" . (The file is an output from a tool which can't be edited.) Note - I had to remove &# from each code to make it readable.

My question is how do I know the encoding for above codes and convert the code to normal text so that it can be processed successfully.

You lost me. bold or italics? In xml? If you open the file using an editor, how do you see it? — Sotiris Panopoulos, Jun 03 '20 at 17:56
Hi. I looks like Unicode, when you say you removed from the numbers. I assume it was in front of each number? Please look at https://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c which explains how do decode Unicode HTML in C#. — Nicolai Schlenzig, Jun 03 '20 at 18:05
How do you read the XML? If you use the XmlReader, XDocument, or XmlDocument it would automatically decode …; entities as well as handle all the other XML edge cases. Don’t parse XML by hand. — ckuri, Jun 03 '20 at 18:08
Hi, Thanks for responding! They can come bold or italics both way. Some character might be bold, some might be italic. I tried HttpUtility.Decode and WebUtility.Decode but the data still remains same, not converted to text. — learner, Jun 04 '20 at 18:44
I can try using XmlReader, XDocument as you suggested. Thanks! — learner, Jun 04 '20 at 18:47

score 2 · Answer 1 · answered Jun 03 '20 at 17:57

Those would be HTML escaped characters; try putting &#55349;&#56400;ℎ&#55349;&#56398;&#55349;&#56415;&#55349;&#56409;&#55349;&#56412; into this form, and note that the unescaped string is ℎ. For methods to perform the decoding in C#, see this SO post.

Unable to identify the encoding of the text C#

1 Answers1