Parsing XML with spaces in element names

Question

So I have to parse a simple XML file (there is only one level, no attributes, just elements and values) but the problem is that there are (or could be) spaces in the XML. I know that's bad (possibly terrible) practice, but I'm not the one that's building the XML, that's coming from an external library.

example:

<live key>test</live key>
<not live>test</not live>
<Test>hello</Test>

Right now my strategy is to read the XML (I have it as a string) one character at a time and just save each element name and value as I get to it, but that seems a bit too complicated.

Is there any easier way to do it? XMLReader would throw an error because it thinks the XML is well-formed, thus it thinks "live" is the element name and "key" is an attribute, so it is trying to look for a "=" and gets a ">".

Personally I'd try to remove or replace all the spaces then load the xml. But that too could be tricky. — juharr, Oct 08 '14 at 15:57
I'd send a strongly worded letter to whoever manages this library — Jonesopolis, Oct 08 '14 at 16:00
Unfortunately, spaces make the input not a well-formed XML, meaning that no standard parser is going to take that; essentially, you are on your own. This is terrible - try convincing the writers of your 3-rd party library to fix this. If they are still around, they should understand why. — Sergey Kalinichenko, Oct 08 '14 at 16:00
Do you have a list of all tags that may have spaces in them, or is that list dynamic? — Sergey Kalinichenko, Oct 08 '14 at 16:03
@dasblinkenlight it's dynamic. and to everyone else, yeah, I think I'm going to ask the person who wrote the library if they can include a JSON option in addition XML. would save a lot of trouble. — Embattled Swag, Oct 08 '14 at 16:05
Why do you refer to this as XML? It is nothing of the kind. If your data supplier wants to invent a custom non-standard variant of XML, someone will need to write parsers for it. That's a lot of effort, I can't see why anyone would want to do that. — Michael Kay, Oct 08 '14 at 16:36

score 3 · Accepted Answer · answered Oct 08 '14 at 16:17

Unfortunately, the text returned by your library is not a well-formed XML, so you cannot use an XML parser to parse it. The spaces in the tags are only part of the problem; there are other issues, for example, the absence of the "root" tag.

Fortunately, a single-level language is trivial enough to be matched with regular expressions. Regex-based "parsers" would be an awful choice for real XML, but this language is not real, so you could use regex at least as a workaround:

Regex rx = new Regex("<([^>\n]*)>(.*?)</(\\1)>");
var m = rx.Match(text);
while (m.Success) {
    Console.WriteLine("{0}='{1}'", m.Groups[1], m.Groups[2]);
    m = m.NextMatch();
}

The idea behind this approach is to find strings with "opening tags" that match "closing tags" with a slash.

Here is a demo, it produces the following output for your input:

live key='test'
not live='test'
Test='hello'

thanks, this was pretty helpful – Embattled Swag Oct 08 '14 at 17:54 — Embattled Swag, Oct 08 '14 at 17:54

score 2 · Answer 2 · answered Oct 08 '14 at 16:16

As it is a flat structure maybe that could help:

    MatchCollection ms = Regex.Matches(xml, @"\<([\w ]+?)\>(.*?)\<\/\1\>");

    foreach (Match m in ms)
    {
        Trace.WriteLine(string.Format("{0} - {1}", m.Groups[1].Value, m.Groups[2].Value));
    }

So you get a list of 'key-value' pairs. Traces are only for checking results

Parsing XML with spaces in element names

2 Answers2