0

I have malformed XML (SOAP) file which I need to parse. The issue is that XML doesn't have proper header tags.

I've tried to parse file with XDocument and XmlDocument but neither has worked. XML starts from the line 30, so maybe there is some way to skip those lines before file is read by XML parser?

<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:eb="http://www.oasis-open.org/committees/ebxml-msg/schema/msg-header-2_0.xsd">
<SOAP-ENV:Header>
</SOAP-ENV:Header>
<SOAP-ENV:Body>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
<?xml version="1.0" encoding="ISO-8859-1"?>
<?xml-stylesheet type="text/xsl" href="Finvoice.xsl"?>
<GGVersion="2.01" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="a.xsd">

XmlReaderSettings settings = new XmlReaderSettings();
                  settings.ConformanceLevel = ConformanceLevel.Fragment;
                  XmlReader r = XmlReader.Create(file.FullName, settings);
                  XmlDocument xDoc = new XmlDocument();
                  xDoc.PreserveWhitespace = true;
                  xDoc.LoadXml("<xml/>");
                  xDoc.DocumentElement.CreateNavigator().AppendChild(r);

                XmlNamespaceManager manager = new XmlNamespaceManager(xDoc.NameTable);

Once trying to parse I get: Unexpected xml declaration. The xml declaration must be the first node in the document ....

sukkis
  • 312
  • 2
  • 17
  • Is that a new XML declaration inside the SOAP message data? Unless there is a CDATA-tag that you cut from the message, I don't think there's any way to make any XML parser accept this. – gnud May 22 '19 at 19:26
  • No, there is no any CDATA -tag in file. I edited first message a bit to give better idea of SOAP header. – sukkis May 22 '19 at 19:35
  • There are... [options...](https://stackoverflow.com/a/1732454/424129) – 15ee8f99-57ff-4f92-890c-b56153 May 22 '19 at 19:39
  • @sukkis I don't quite understand how the XML looks still. Do you mean there is a SOAP header first, and _then_ a full XML document, and you want to throw away the soap header? Is there nothing of the SOAP envelope after the XML you are looking for? In that case this seems very solveable with a simple `Substring` method call. – gnud May 22 '19 at 19:54
  • @gnud Exactly, so there is only SOAP header message before well formatted XML starts. How should I play with Substring, I'm quite a novice with C#. – sukkis May 22 '19 at 20:03
  • Does the XML _always_ start at line 30? In that case I might have a cleaner solution. – gnud May 22 '19 at 20:19
  • Yes it does and thanks for your great solution already – sukkis May 22 '19 at 20:26
  • 1
    @sukkis I added an alternative method for skipping to line 30. – gnud May 22 '19 at 20:34

1 Answers1

2

If I understand you correctly, then the data you are looking for starts after the SOAP envelope. There is no garbage/unnessescary contents after the data you are looking for. The SOAP header does not start with the XML declaration (<?xml version=, etc).

Looking for the start of the document

A simple solution is to find the start of the XML document (the data you are looking for), and chop away everything before that.

var startOfRealDocumentMarker = "<?xml version=\"1.0\"";
var startIndex = dirtyXmlString.IndexOf(startOfRealDocumentMarker);
if(startIndex == -1) {
    throw new Exception("Start of XML not found. Now what?");
}
var cleanXmlString = dirtyXmlString.Substring(startIndex);

If the SOAP header also has an XML declaration, you could look for the end-tag of the SOAP envelope instead. Or you could start looking for the declaration at the 2nd character, so you would skip over the first one.

This is obviously not a fool-proof solution that will work in every case. But maybe it will work in all of your cases?

Skipping lines

If you're sure it will work to always start reading from line 30 of the input file, you can use this method instead.

XmlDocument xDoc = new XmlDocument();    
using (var rdr = new StreamReader(pathToXmlFile))
{
    // Skip until reader is positioned at start of line 30
    for (var i = 0; i < 29; ++i)
    {
        rdr.ReadLine();
    }       
    // Load document from current position of reader
    xDoc.Load(rdr);
}
Community
  • 1
  • 1
gnud
  • 77,584
  • 5
  • 64
  • 78