0

I have a xml file which contains information about a customer like FirstName, LastName, Address etc. The text in any of the field might be bold or italics but the problem is, it is converted to code like "55349;56400;ℎ55349;56398;55349;56415;55349;56409;55349;56412;55349;" . (The file is an output from a tool which can't be edited.) Note - I had to remove &# from each code to make it readable.

My question is how do I know the encoding for above codes and convert the code to normal text so that it can be processed successfully.

  • 1
    You lost me. bold or italics? In xml? If you open the file using an editor, how do you see it? – Sotiris Panopoulos Jun 03 '20 at 17:56
  • Hi. I looks like Unicode, when you say you removed from the numbers. I assume it was in front of each number? Please look at https://stackoverflow.com/questions/122641/how-can-i-decode-html-characters-in-c which explains how do decode Unicode HTML in C#. – Nicolai Schlenzig Jun 03 '20 at 18:05
  • How do you read the XML? If you use the XmlReader, XDocument, or XmlDocument it would automatically decode …; entities as well as handle all the other XML edge cases. Don’t parse XML by hand. – ckuri Jun 03 '20 at 18:08
  • Hi, Thanks for responding! They can come bold or italics both way. Some character might be bold, some might be italic. I tried HttpUtility.Decode and WebUtility.Decode but the data still remains same, not converted to text. – learner Jun 04 '20 at 18:44
  • I can try using XmlReader, XDocument as you suggested. Thanks! – learner Jun 04 '20 at 18:47

1 Answers1

2

Those would be HTML escaped characters; try putting ��ℎ�������� into this form, and note that the unescaped string is ℎ. For methods to perform the decoding in C#, see this SO post.

fuglede
  • 17,388
  • 2
  • 54
  • 99