Fixing bad XML file (eg. unescaped & etc.)

Question

I got an XML file from 3rd party that I must import in my app, and XML had elements with unescaped & in inner text, and they don't wont to fix that ! So my question is what is the best way to deal with this problem ?

This XML is pretty big and that fix has to be fast, my first solution is just replace & character with ampersand but really I don't like this "solution" for obvious reasons. I don't know how to use XmlStringReader with such XML because is throws exception on such lines, so I can't use HtmlEncode on inner text. I tried to set XmlTextReader Settings.CheckCharacters to false but no result.

Here is the sample, & is in element, and in that field can be anything that can be in some company name, so my replace fix maybe don't work for some other company name, I would like to use HtmlEncode somehow, but only on inner text of course.

<komitent ID="001398">
  <sifra>001398</sifra>
  <redni_broj>001398</redni_broj>
  <naziv>LJUBICA & ŽARKO</naziv>
  <adresa1>Odvrtnica 27</adresa1>
  <adresa2></adresa2>
  <drzava>HRVATSKA</drzava>
  <grad>Zagreb</grad>
</komitent>

Switch the 3rd party :-) Honestly if this party is not capable of providing a valid XML I would strongly reconsider using it. — Darin Dimitrov, May 16 '11 at 14:31
@Darin, I would *really*, *really* like to do that, but unfortunatly that is not an option :( — Antonio Bakula, May 16 '11 at 14:38
@Antonio Bakula, in this case your best bet is string/replace hoping that you have covered all the possible cases of where this XML could be broken. I mean if the XML is not valid you cannot possibly know where it can be broken so you cannot rely on a XML parser. Today it's a broken ampersand, tomorrow it's a missing closing `>` and the day after a missing closing tag. You see my point? The best way to fix something broken is to not break it in the first place. — Darin Dimitrov, May 16 '11 at 14:40
@Darin Dimitrov, I am perfectly aware of that, thanks :) But was hoping there is a better solution — Antonio Bakula, May 16 '11 at 14:49
@Antonio Bakula, what you have to understand is that **you don't have an XML file**. You have a plain text file. So if the format of this file is not defined you will need to manually parse it. That's why people created formats like XML and defined standards for them. So if the 3rd party cannot provide you with an XML file, at least ask them to define the format of the text file they are providing you so that the parser that you will have to write is as reliable as possible or ask them to provide you with a parser for this custom format. — Darin Dimitrov, May 16 '11 at 14:52
@Darin Dimitrov, they gave me other option, connect to their database with ODBC :( — Antonio Bakula, May 16 '11 at 15:02
@Antonio Bakula, what's their database? Can't you connect directly through a native ADO.NET provider? That stinks a mainframe and would probably explain the lousy XML :-). — Darin Dimitrov, May 16 '11 at 15:03
Of course, the 3rd party cannot fix the XML: that will break all the workarounds implemented by their customers! — Álvaro González, May 16 '11 at 15:14
@Darin Dimitrov, it's Clarion TopSpeed database, I don't wanna go that way because there is absolutely no security, when connected with ODBC my code can write to DB (!!) And I don't wont to know their messed up schema, and maybe be blamed for their errors ;) It's a long story, I will push this "fix" and hopefully forget all about that. — Antonio Bakula, May 16 '11 at 15:29
@DarinDimitrov We all know what the xml parser *should* do: treat it like an xml file, with the one exception: if you encounter a `&` where it is not valid to have an unescaped `&`, and the `&` together with the characters that follow it do not form a valid entity reference, convert the `&` into `&`. The only problem is that to fix that i would have to re-implement 99.9999% of an xml parser, including doctypes, encoding, elements, attributes, entities, cdata, whitespace, prefixes, namespaces, all to add the 0.000001% code. Ideally someone's already done the hard work of processing xml. — Ian Boyd, Dec 03 '17 at 19:31

Paul Butcher · Accepted Answer · 2011-05-16T15:58:34.923

The key message below is that unless you know the exact format of the input file, and have guarantees that any deviation from XML is consistent, you can't programmatically fix without risking that your fixes will be incorrect.

Fixing it by replacing & with & is an acceptable solution if and only if:

There is no acceptable well-formed source of these data.
- As @Darin Dimitrov comments, try to find a better provider, or get this provider to fix it.
- JSON (for example) is preferable to poorly formed XML, even if you aren't using javascript.
This is a one off (or at least extremely infrequent) import.
- If you have to fetch this in at runtime, then this solution will not work.
You can keep iterating through, devising new fixes for it, adding a solution to each problem as you come across it.
- You will probably find that once you have "fixed" it by escaping & characters, there will be other errors.
You have the resources to manually check the integrity of the "fixed" data.
- The errors you "fix" may be more subtle than you realise.
There are no correctly formatted entities in the document -
- Simply replacing & with & will erroneously change " to &quot;. You may be able to get around this, but don't be naive about how tricky it might be (entities may be defined in a DTD, may refer to a unicode code-point ...)
- If it is a particular element that misbehaves, you could consider wrapping the content of the element with <![CDATA ]]>, but that still relies on you being able to find the start and end tags reliably.

score 2 · Answer 2 · answered Jan 14 '14 at 14:46

If you know the tags of the file and want to "okay" the text inside the tags that could have suspect data, you could do something like this:

private static string FixBadXmlText(string xmlText)
{           
    var unreliableTextTags = new[] { "message", "otherdata", "stacktrace", "innerexception" };

    foreach(var tag in unreliableTextTags)
    {
        string openTag = "<" + tag + ">";
        string closeTag = "</" + tag + ">";
        xmlText = xmlText.Replace(openTag, openTag + "<![CDATA[").Replace(closeTag, "]]>" + closeTag);
    }

    return xmlText;
}

Anything inside a CDATA Section (<![CDATA[ {your text here} ]]>) will not be interpreted by an XML parser so doesn't need to be escaped. This helped me when wanting to parse some poorly made XML that didn't properly escape the input.

score 2 · Answer 3 · answered May 16 '11 at 21:47

2

Start by changing your mindset. The input is not XML, so don't call it XML. Don't even use "xml" to tag your questions about it. The fact that it isn't XML means that you can't use any XML tools with it, and you can't get any of the benefits of XML data interchange. You're dealing with a proprietary format that comes without a specification and without any tools. Treat it as you would any other proprietary format - try to discover a specification for what you are getting, and write a parser for it.

answered May 16 '11 at 21:47

Michael Kay

156,231
11
92
164

1

Can you write, or suggest, the parser that does exactly what we all know it should do? If the `&` is in a location where it is invalid to have an unescaped `&`, and the `&` does not lead to a valid entity reference, replace the `&` with `&`. – Ian Boyd Dec 03 '17 at 19:26
I could write such a parser but I'm not going to. I happen to think that XML made the right decision by saying invalid escape sequences should be rejected. Any laxness in validating incoming XML just encourages more laxness among people generating XML, and you'll soon have people who don't bother escaping special characters at all, which means you'll get even more confusion when you encounter a valid escape sequence and don't know whether it needs to be unescaped or not. – Michael Kay Dec 04 '17 at 09:19
1

Which is fine to be strict. But now i, like the person who asked this question, are left with a problem. No text editor on the planet can open a 4.2GB file. And even if it could, with millions of invalid entries, i have a programming problem of how to fix it. Surely a question worthy of stackoverflow. – Ian Boyd Dec 04 '17 at 14:33
Sure it's a worthy question and I have tried to give a worthy answer. – Michael Kay Dec 04 '17 at 15:17

score 0 · Answer 4 · answered Apr 26 '15 at 15:18

You can handle the file as XPL and even use the XPL parser to transform such files into valid XML. XPL (eXtensible Process Language) is just like XML but the parser allows XML's "special characters" in text fields. So, you can in fact run an invalid XML file (invalid due to special characters) through the XPL process. In some cases, you can use the XPL processor instead of an XML processor. You can also use it to preprocess the invalid files without any performance loss. Artificial Intelligence, XML, and Java Concurrency

score 0 · Answer 5 · answered May 16 '11 at 15:04

Since your starting XML is erroneous you can't use any XmlReaders because they can't read it correctly.

If only the values of the XML nodes aren't htmlEncoded, than you'd have to go and manually read line, parse (get the xml node name and it's value), encode and output to a new file.

Often times we end up in a similar situation so I understand your pains - most of the time though, the errors have some "rule", so I'm guessing here they didn't encode the Business Name (and maybe the street name), so you can just search for that string <naziv>, and it's closing tag </naziv> and HtmlEncode everything in between. Also, since it's business name, it won't have line breaks, which can ease your life quite a bit...

score 0 · Answer 6 · answered May 16 '11 at 18:30

You could try something with regular expressions depending on how complex the structure is:

Regex mainSplitter = new Regex("<komitent ID=\"([0-9]*)\">(.*?)</komitent>");
Regex nazivFinder = new Regex("<naziv>(.*?)</naziv>");

foreach (Match item in mainSplitter.Matches(test))
{
    Console.WriteLine(item);

    string naziv = null;

    Match node = nazivFinder.Match(item.Groups[2].Value);
    if (node != null)
        naziv = node.Groups[1].Value;
}

Fixing bad XML file (eg. unescaped & etc.)

6 Answers6

Linked

Related