0

While saving the existing XML to new location, entities escaped from the content and replaced with Question Mark

See the snaps below entity ‐ (- as Hex) present while reading but its replaced with question mark after saving to another location.

While Reading as Inner XML

While Reading as inner XML

While Reading as Inner Text

While Reading as inner Text

After Saving XML File

After Saving XML

EDIT 1 Below is my code

string path = @"C:\work\myxml.XML";
string pathnew = @"C:\work\myxml_new.XML";
//GetFileEncoding(path);
XmlDocument document = new XmlDocument();
XmlDeclaration xmlDeclaration = document.CreateXmlDeclaration("1.0","US-ASCII",null);
//document.CreateXmlDeclaration("1.0", null, null);
document.Load(path);
string x = document.InnerText;
document.Save(pathnew);

EDIT 2 My source file looks like below. I need to retain the entities as it is

enter image description here

Karthick Gunasekaran
  • 2,697
  • 1
  • 15
  • 25
  • 2
    It's almost certainly an encoding problem, but no one can help you without providing some *code* instead of pictures. How are you writing the XML? – Charles Mager May 10 '16 at 13:39
  • @CharlesMager, Thanks for your attempt. See the edited question – Karthick Gunasekaran May 10 '16 at 13:44
  • Is your source file *actually* US-ASCII? Or does the declaration just say it is? I don't think your character [exists in ASCII](https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart), which is why it's being replaced. `XmlDocument` is inferring the encoding to use on save from the declaration. – Charles Mager May 10 '16 at 13:45
  • Why not use [File.Copy](https://msdn.microsoft.com/en-us/library/system.io.file.copy%28v=vs.110%29.aspx)? – Alexander Petrov May 10 '16 at 13:50
  • @CharlesMager, yes my source file is US-ASCII encoding. pls. see here my unicode char http://unicodelookup.com/#‐/1 – Karthick Gunasekaran May 10 '16 at 14:05
  • @AlexanderPetrov, i am not copying the exact file but i need to update specific node in existing xml file – Karthick Gunasekaran May 10 '16 at 14:10
  • 1
    Ascii encoding removes non-printable characters. So you can have ascii (one byte characters) that are not unicode (two byte characters) that will result in question marks. – jdweng May 10 '16 at 14:42
  • 1
    @Karthick I'd be pretty sure your source file *isn't* ASCII. That character doesn't exist in ASCII. Open the file in Notepad++ or something and check the encoding. – Charles Mager May 10 '16 at 14:47
  • @jdweng, Thanks for your information.how can i retain the unicode characters as it is in the source. in Source xml the unicode hex is ‐. which is nothing but hyphen. i want to retain this hex after updating the specific node and save. – Karthick Gunasekaran May 10 '16 at 14:49
  • @CharlesMager, yes you are right its not in ASCII category. but i need to retain the entity characters as it is in source. pls see the edited part 2 – Karthick Gunasekaran May 10 '16 at 14:52
  • Try this : XmlDeclaration xmlDeclaration = document.CreateXmlDeclaration("1.0", "unicode", null); – jdweng May 10 '16 at 15:06
  • f y ou use a XmlReader you can turn the check characters off : XmlReaderSettings settings = new XmlReaderSettings() { CheckCharacters = false}; – jdweng May 10 '16 at 15:11
  • 1
    @Karthick it's very easy to do, but you need to know what encoding you need to write your file in. So you need to check what it's actually encoded in. – Charles Mager May 10 '16 at 15:19
  • @jdweng i tried with xmlDeclaration but not use it returns question mark – Karthick Gunasekaran May 10 '16 at 15:19
  • @CharlesMager i need to write UTF-8 encoding characters with US-ASCII xml – Karthick Gunasekaran May 10 '16 at 15:20
  • i tried to find the encoding method which is in source file its shows me as default (iso-8859-1) – Karthick Gunasekaran May 10 '16 at 15:21
  • 8859-1 can be unicode. See webpage : https://en.wikipedia.org/wiki/ISO/IEC_8859-1 – jdweng May 10 '16 at 15:32
  • 1
    Ah, ok - your latest edit makes a lot more sense. The file *is* ASCII encoded, but the character an entity reference. – Charles Mager May 10 '16 at 15:32
  • @CharlesMager, yes your understanding is absolutely right – Karthick Gunasekaran May 10 '16 at 15:34
  • Possible duplicate of [How do I XmlDocument.Save() to encoding="us-ascii" with numeric character entities instead of question marks?](http://stackoverflow.com/questions/22394441/how-do-i-xmldocument-save-to-encoding-us-ascii-with-numeric-character-entiti) – Charles Mager May 10 '16 at 15:49
  • As a general aside, for most XML writers, you can't "retain the entity characters as it is in source". There is no reason to use a numeric character entity if the file character set supports it. – Tom Blodget May 10 '16 at 16:22

1 Answers1

3

The issue here seems to be the handling of encoding of entity references by the specific XmlWriter implementation internal to XmlDocument.

The issue disappears if you create an XmlWriter yourself - the unsupported character will be correctly encoded as an entity reference. This XmlWriter is a different (and newer) implementation that sets an EncoderFallback that encodes characters as entity references for characters that can't be encoded. Per the remarks in the docs, the default fallback mechanism is to encode a question mark.

var settings = new XmlWriterSettings
{
    Indent = true,
    Encoding = Encoding.GetEncoding("US-ASCII")
};

using (var writer = XmlWriter.Create(pathnew, settings))
{
    document.Save(writer);            
}

As an aside, I'd recomment using the LINQ to XML XDocument API, it's much nicer to work with than the old creaky XmlDocument API. And its version of Save doesn't have this problem, either!

Charles Mager
  • 25,735
  • 2
  • 35
  • 45