0

I am using the following routine to format an XML file:

public static string FormatXml(string xml, bool clean = true)

public static string FormatXml(string xml, bool clean = true)
        {
            if (xml.Trim().Length == 0)
            {
                return "";
            }

            var stringBuilder = new StringBuilder();

            try
            {
                string modifiedXml = xml;

                if (clean)
                {
                    modifiedXml = CleanXml(xml);
                }

                var element = XElement.Parse(modifiedXml);

                var settings = new XmlWriterSettings();
                settings.OmitXmlDeclaration = true;
                settings.Indent = true;
                settings.NewLineOnAttributes = true;

                using (var xmlWriter = XmlWriter.Create(stringBuilder, settings))
                {
                    element.Save(xmlWriter);
                }

                return stringBuilder.ToString();

            }
            catch (Exception e)
            {
                //MessageBox.Show(e.Message);
                return xml;
            }

            return xml;
        }

But this routine chokes when it tries to format an XML file that did not encode the ampersand in the name property as &

    <process id="702fe4d7-f312-49b9-959e-5cc8a421d38a" name="108_CareAllies_18&23_DSA & HV_ServiceOpsReport_Weekly" xmlns="http://www.blueprism.co.uk/product/process">

I get this error:

"'\"' is an unexpected token. The expected token is ';'. Line X, position Y." (which points to the ampersand position.) I don't have much experience parsing XML and I see myself spending a lot of time to come up with a routine to replace these occurrences with their encoded equivalents before calling the above routine.

I am looking for an efficient way to format many large XML files. Is there an easy and fast way to format XML files that have special characters in them? I

Chad
  • 23,658
  • 51
  • 191
  • 321
  • It looks like you have invalid XML files that you try to read... There are plenty existing posts on how to do that... Since invalid XML is not really XML you either should look for getting them fixed or look for lax parsers (you may try HtmlAgilityPack as it designed to read through garbage)… beware of "GIGO" (garbage in-garbage out)... – Alexei Levenkov Nov 25 '19 at 23:08
  • I wasn't sure that and ampersand inside a quoted field really needs to be encoded. If not, the file wouldn't be considered garbage. – Chad Nov 26 '19 at 00:07
  • 1
    Re-read https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents – Alexei Levenkov Nov 26 '19 at 00:11
  • You haven't stated which line of your code produces the error. Neither have you shown what the CleanXml() method does. Both of those are probably important in order for someone to answer this question properly. – robbpriestley Nov 26 '19 at 00:17
  • 1
    @robbpriestley ` name="108_CareAllies_18&23_DSA & HV_ServiceOpsReport_Weekly"` is invalid XML and the error clearly points to the same - reading invalid XML. So while more details may be nice there is really no need to polish yet another "how to read text that pretends to be XML but isn't" - question... OP can decide whether to keep it, to find good duplicate or turn into something different. – Alexei Levenkov Nov 26 '19 at 00:21
  • @AlexeiLevenkov it sounds like you have it completely under control – robbpriestley Nov 26 '19 at 00:28
  • **`&` may not appear in an XML attribute value unless part of an entity such as `&`**. See duplicate links for further help. – kjhughes Nov 26 '19 at 01:09
  • "an XML file that did not encode the ampersand in the name property as &" - there is no such thing. If it doesn't encode ampersand correctly, then it isn't an XML file. – Michael Kay Nov 26 '19 at 01:14

0 Answers0