0

I'm trying to modify some node value from one xml file to another using the below program which gets the value from the first node pub-title from a xml file in a folder called abc and then pastes the value to the first node publisher-name in another xml file in a folder named xyz.

NOTE: The escape_string method is implemented to not modify the UTF-8 entity values and keep them as they are.

var job_folders = Directory.EnumerateDirectories(textBox1.Text, "*", SearchOption.TopDirectoryOnly);
foreach (string job_folder in job_folders)
{
    var target_xml_file = Directory.GetFiles(job_folder, "*.xml", SearchOption.AllDirectories).Where(a => Path.GetFileName(Path.GetDirectoryName(x)).ToLower() == "abc").First();
    var target_meta_file = Directory.GetFiles(job_folder, "*.xml", SearchOption.AllDirectories).Where(a => Path.GetFileName(Path.GetDirectoryName(x)).ToLower() == "xyz").First();

    string path = Path.GetFullPath(target_meta_file);
    string file_content = escape_string(File.ReadAllText(path), 0);
    XDocument doc = XDocument.Parse(file_content, LoadOptions.PreserveWhitespace);
    var lbl=doc.Descendants("pub-title").First().Value;
    XDocument doc2 = XDocument.Parse(escape_string(File.ReadAllText(target_xml_file), 0), LoadOptions.PreserveWhitespace);
    doc2.DocumentType.InternalSubset = null;
    doc2.Descendants("publisher-name").First().Value=lbl;
    doc2.Save(target_xml_file);
    File.WriteAllText(target_xml_file, escape_string(doc2.ToString(), 1));
}

MessageBox.Show("Complete");

private static string escape_string(string input_string, int option)
{
    switch (option)
    {
        case 0:
            return input_string.Replace("&", "&").ToString();
        case 1:
            return input_string.Replace("&", "&").ToString();
        default:
            return null;
    }
}

Everything goes fine but <?xml version="1.0" encoding="utf-8"?> is getting deleted from the file in target_xml_file.

How do I fix this? File before modification

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd"[]>
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id" />
<journal-title-group>
<journal-title>Eleventh &#x0026; Tenth International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>SPIE</publisher-name>
</publisher>
</journal-meta>
....
....

File after

<?xml-stylesheet type="text/xsl" href="jats-html.xsl"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">
<article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id" />
<journal-title-group>
<journal-title>Eleventh &#x0026; Tenth International Conference on Correlation Optics</journal-title>
</journal-title-group>
<issn pub-type="epub">0277-786X</issn>
<publisher>
<publisher-name>a</publisher-name>
</publisher>
</journal-meta>
Don_B
  • 243
  • 2
  • 15
  • 1
    Why do you save the file twice, the second time as tekst instead of an xml file? What happens if you remove `File.WriteAllText(target_xml_file, escape_string(doc2.ToString(), 1));` after `doc2.Save(target_xml_file);`? – oerkelens Nov 30 '17 at 13:27
  • `XDocument.ToString()` does not include the XML declaration. It is available explicitly in the `.Declaration` property. – Jeroen Mostert Nov 30 '17 at 13:29
  • @JeroenMostert What do I do then? – Don_B Nov 30 '17 at 13:31
  • @oerkelens If I remove `File.WriteAllText(target_xml_file, escape_string(doc2.ToString(), 1));` then the UTF-8 codes like `—` are kept like `&#x2014;`..I need to the escape_string method to revert it back to `—` – Don_B Nov 30 '17 at 13:35
  • But why do you even try `doc2.Save(target_xml_file);` then? Have you looked what happens after that or have you only tried to overwrite the file again after you have saved it the first time as an XML-document? – oerkelens Nov 30 '17 at 13:40
  • Please provide a [mcve], as a console app, ideally with idiomatic C# names (e.g. `targetXmlFile` rather than `target_xml_file`). It's not clear why you're manually escaping anything at all - that's almost always a bad idea, to be honest. – Jon Skeet Nov 30 '17 at 13:41
  • @Don_B: you could [use the property](https://stackoverflow.com/questions/1228976/xdocument-tostring-drops-xml-encoding-tag). I haven't looked deeply at your use case because it's confusing and messy. – Jeroen Mostert Nov 30 '17 at 13:41
  • @oerkelens the problem remains even if I omit `doc2.Save(target_xml_file);` – Don_B Nov 30 '17 at 13:48
  • And if you _only_ do `doc2.Save(target_xml_file);`? That line is supposed to save the file as you want it, but the next line destroys the file. Either `doc2.Save(target_xml_file);` does exactly what you do and `File.WriteAllText` is useless and destructive, or `doc2.Save(target_xml_file);` does_not do what you want and you should _replace_ it. But if you overwrite th efile like you do now, you can not know if it did what it should do. – oerkelens Nov 30 '17 at 14:05
  • @oerkelens before any operation the `target_xml_file` file contains strings like `á`, then while parsing the file I use `escape_string(File.ReadAllText(target_xml_file), 0)` to convert those strings to `&#x00E1;` so that it does not get converted to its character counter like `á`, then I need to change `&#x00E1;` to `á` that is why I use `File.WriteAllText(target_xml_file, escape_string(doc2.ToString(), 1));` Do you get it now? – Don_B Nov 30 '17 at 14:15
  • I get that you have an xml document, doc2, and you convert it to string instead of saving it as an xml document. What I do not get is why you have code to save you xml document as an xml document, only to overwrite it after you save it. What does your file look like after `doc2.Save(target_xml_file);`? – oerkelens Nov 30 '17 at 14:26
  • @oerkelens check the updated question..if I do only `doc2.Save` then `Eleventh & Tenth International Conference on Correlation Optics` becomes `Eleventh &#x0026; Tenth International Conference on Correlation Optics` – Don_B Nov 30 '17 at 14:41

2 Answers2

1

Following the answer to XDocument.ToString() drops XML Encoding Tag you should not use ToString method, use StringWriter instead:

using (var stream = new MemoryStream())
{
    using (var writer = new XmlTextWriter(stream, Encoding.UTF8))
    {
        doc2.Save(writer);
    }
    string xml = escape_string(Encoding.UTF8.GetString(stream.ToArray()), 1);
    File.WriteAllBytes(target_xml_file, Encoding.UTF8.GetBytes(xml));
}
Andrii Litvinov
  • 12,402
  • 3
  • 52
  • 59
  • now `` is being converted to `` – Don_B Dec 04 '17 at 15:45
  • @Don_B that's because strings in C# are unicode or utf-16 encoded. If you want utf-8 encoding you can use `XmlTextWriter` and provide `Encoding.UTF8` as second parameter. But you must write to either file or a stream, because, again, strings are utf-16. – Andrii Litvinov Dec 04 '17 at 16:02
  • @Don_B just create a new stringwriter that inherits from `StringWriter` and override encoding (`public override Encoding Encoding => Encoding.UTF8;`) – FakeCaleb Dec 04 '17 at 16:08
  • @Don_B, are you sure you need to use `escape_string`? Does not seem quite right. – Andrii Litvinov Dec 04 '17 at 16:11
  • @AndriiLitvinov how else can I keep the codes like `*, & ..etc` unchanged in the file? – Don_B Dec 04 '17 at 16:12
  • @Don_B I have shown code, just create a class that inherits from `StringWriter` and use that instead? Use the code above that changes the encoding – FakeCaleb Dec 04 '17 at 16:14
  • @Don_B I have updated my answer. It is not that efficient with back-and-forth conversions, but will do the trick. – Andrii Litvinov Dec 04 '17 at 17:31
  • @AndriiLitvinov Thanks...But when I operate on large number of files its not that efficient. – Don_B Dec 05 '17 at 02:33
  • @Don_B the efficiency of this solution is more or less same as in your original implementation. My concern is unicode character encoding you handle with `escape_string`. I think that `XDocument` should be capable to handle it correctly to avoid the need of `escape_string` completely. I am not sure why it does not handle those characters. Consider to ask another question on that matter, maybe someone will help. And the you will be able to write xml directly to file with correct encoding. – Andrii Litvinov Dec 05 '17 at 07:03
  • Maybe there is no solution to it though. I have made small search and found a solution that you've probably used here https://stackoverflow.com/a/5410901/2138959. – Andrii Litvinov Dec 05 '17 at 07:59
0

Why not simply add an XDeclaration method after the process, something like

new XDeclaration("1.0", "utf-8", null)

Then save the file. It takes only two lines of code.

Tamal Banerjee
  • 503
  • 3
  • 20
  • A little surprised to see this answer accepted because UTF-8 encoding will be used only when saving the document which prevents from post=processing in `escape_string`. And when `doc.ToString` is used the declaration is stripped anyway. And more over encoding will be UTF-16, not UTF-8. Following this way of thinking it was easier to simply prepend resulting string with `""`. – Andrii Litvinov Dec 10 '17 at 06:59
  • @AndriiLitvinov what do you mean by prepend resulting string with `""`? Wouldn't that be deleted also while calling the `doc.ToString` method? – Tamal Banerjee Dec 10 '17 at 11:22
  • I mean `"" + escape_string(doc2.ToString(), 1)` from the OP's code sample. – Andrii Litvinov Dec 10 '17 at 11:25
  • That will work, but your suggested solution won't. It actually does not change anything because xml declaration is there in the `XDcoument` after parsing original xml document. – Andrii Litvinov Dec 10 '17 at 13:58