0

I look for a way to beautify incomplete XML documents. In best case it should handle even large sizes (e.g. 10 MB or maybe 100 MB).

Incomplete means that the documents are truncated at a random position. Until this position the XML has a valid syntax. Beautify means to add line breaks and leading spaces between the tags.

In my case it's needed to analyse aborted streams. Without line breaks and indentions it's really hard to read for a human. I know there are some editors which can beautify incomplete documents, but I want to integrate the beautifier into my own analysis tool.

Unfortunately I did't find a discussion or solution for that case.

The nuget package GuiLabs.Language.Xml of Kirill Osenkov (repository XmlParser) seems to be a useful candidate for an own beautifier implementation, because it's designed to be error tolerant. Unfortunately there is too less documentation to understand how to use this parser.

Example xml:

<?xml encoding="UTF-8"?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p="pp"/><nn:A>cc</nn:A><D><E>eee</

Expected result as string:

<?xml encoding="UTF-8"?>
<X>
    <B>
        <C>aa</C>
    <B/>
    <A.B>
        <X>bb</X>
    </A.B>
    <A p="pp"/>
    <nn:A>cc</nn:A>
    <D>
        <E>eee</
ProgrammingLlama
  • 36,677
  • 7
  • 67
  • 86
Beauty
  • 865
  • 11
  • 14
  • You can't parse invalid XML, it's invalid. You'd need to treat the file as a text file and write your own custom "beautifier". – Liam Jul 28 '21 at 10:26
  • Browsers display XML by applying an XSLT transformation to it, to convert it to HTML. You can use an XmlReader to read XML token by token, thus incomplete XML, but *not* incomplete tokens. If you only want to produce text, you can use an XmlReader to read tokens and emit them along with newlines or tabs. You may be able to apply [an XSLT transformation to an XmlReader, writing directly to an XmlWrite](https://docs.microsoft.com/en-us/dotnet/standard/linq/use-xslt-transform-xml-tree) but that will definitely throw in the last incomplete token. The generated output may be enough at this point – Panagiotis Kanavos Jul 28 '21 at 10:34
  • Well, I doubt that the `` meets the criteria for an XML declaration so even with an XML parser where you can control abortion by catching a parse error I doubt the parser gets beyond that crippled XML declaration. – Martin Honnen Jul 28 '21 at 12:09
  • Also, you are using prefixes without declaring namespaces, also something that is hard to come by with at the least the default settings of XML parsers, although if you can set yours to be non-namespace-aware it might work. – Martin Honnen Jul 28 '21 at 12:12
  • Thanks for your comments and answers! @MartinHonnen: You are right with the XML declaration and namespace. I didn't add more declarations the example code, because it should be tiny. Just showing, what I mean. – Beauty Jul 28 '21 at 13:16
  • 1
    You can stream from an `XmlReader` to an `XmlWriter` and format the output, see [this answer](https://stackoverflow.com/a/68073898/3744182) to [Format XML string to print friendly XML string](https://stackoverflow.com/q/1123718/3744182). If you catch the exception thrown and flush the `XmlWriter`, the output should contain some formatted XML subset. However if your XML has been randomly truncated, the final, malformed node might not get written. – dbc Jul 31 '21 at 17:25

2 Answers2

1

The error ignoring "XML" parser of AngleSharp.Xml can be used to parse your sample, though missing tags will be added, you can then get an XML string representation of the built document and with the help of legacy XmlTextReader and XmlTextWriter which allow you to ignore namespaces you can at least indent the markup:

       var xml = @"<?xml encoding=""UTF-8""?><X><B><C>aa</C><B/><A.B><X>bb</X></A.B><A p=""pp""/><nn:A>cc</nn:A><D><E>eee</"; 

        var xmlParser = new XmlParser(new XmlParserOptions() { IsSuppressingErrors = true });

        var doc = xmlParser.ParseDocument(xml);

        Console.WriteLine(doc.ToMarkup());

        using (StringReader sr = new StringReader(doc.ToXml()))
        {
            using (XmlTextReader xr = new XmlTextReader(sr))
            {
                xr.Namespaces = false;

                using (XmlTextWriter xw = new XmlTextWriter(Console.Out))
                {
                    xw.Namespaces = false;
                    xw.Formatting = Formatting.Indented;

                    xw.WriteNode(xr, false);
                }
            }
        }
    }

e.g. get

<X>
  <B>
    <C>aa</C>
    <B />
    <A.B>
      <X>bb</X>
    </A.B>
    <A p="pp" />
    <nn:A>cc</nn:A>
    <D>
      <E>eee</E>
    </D>
  </B>
</X>

As your text says "Until this position the XML has a valid syntax" and your comment suggests the errors in your sample are just due to sloppiness I think it might also be possible to use WriteNode of an XmlWriter with XmlWriterSettings.Indent set to true on a standard XmlReader, as long as you catch the exception the XmlReader throws:

        var xml = @"<?xml version=""1.0""?><root><section><p>Paragraph 1.</p><p>Paragraph 2.";

        try
        {
            using (StringReader sr = new StringReader(xml))
            {
                using (XmlReader xr = XmlReader.Create(sr))
                {
                    using (XmlWriter xw = XmlWriter.Create(Console.Out, new XmlWriterSettings() { Indent = true }))
                    {
                        xw.WriteNode(xr, false);
                    }
                }
            }
        }
        catch (XmlException e)
        {
            Console.WriteLine();
            Console.WriteLine("Malformed input XML: {0}", e.Message);
        }

gives

<?xml version="1.0"?>
<root>
  <section>
    <p>Paragraph 1.</p>
    <p>Paragraph 2.</p>
  </section>
</root>
Malformed input XML: Unexpected end of file has occurred. The following elements are not closed: p, section, root. Line 1, position 71.

So no need with WriteNode to handle every possible Readxxx and node type and call the corresponding Writexxx on the XmlWriter by you own code.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
0

Does it have to be C#?

In Java, you should be able to pipe the output of a SAX parser into an indenting serializer by connecting a SAXSource to a StreamResult using an identity transformer, and then just make sure that when the SAX parser aborts, you trap the exception and close the output stream tidily.

I think you can probably do the same thing in C# but not quite as conveniently: coupling the events read from an XmlReader and sending the corresponding events to an XmlWriter is a lot more tedious because you have to write code for each separate kind of event.

If you want a C# solution and you're prepared to install Saxon enterprise edition, you can write a simple streaming transformation:

<transform version="3.0" xmlns="http://www.w3.org/1999/XSL/Transform">
  <output method="xml" indent="yes"/>
  <mode streamable="yes" on-no-match="shallow-copy"/>
</transform>

invoke it from the Saxon API using XsltTransformer with a Serializer as the destination, and again, catch the exception and flush/close the output stream to which the Serializer is writing.

Using Saxon on Java would be overkill because the identity transformer does this "out of the box".

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • I read the requirement at face value: the OP says that the XML has valid syntax until the point where it's truncated: I was assuming he/she wanted to serialize everything up to the first syntax error. I didn't look at the examples too closely. – Michael Kay Jul 28 '21 at 17:51
  • Right, I kind of went by the strange sample and thought the whole syntax might break any XML parser. But you are right, the text suggests the markup is fine other than being incomplete and the latest comment of the poster kind of confirms that the sample is just sloppy. – Martin Honnen Jul 28 '21 at 18:27