13

We have a string field which can contain XML or plain text. The XML contains no <?xml header, and no root element, i.e. is not well formed.

We need to be able to redact XML data, emptying element and attribute values, leaving just their names, so I need to test if this string is XML before it's redacted.

Currently I'm using this approach:

string redact(string eventDetail)
{
    string detail = eventDetail.Trim();
    if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
    ...

Is there a better way?

Are there any edge cases this approach could miss?

I appreciate I could use XmlDocument.LoadXml and catch XmlException, but this feels like an expensive option, since I already know that a lot of the data will not be in XML.

Here's an example of the XML data, apart from missing a root element (which is omitted to save space, since there will be a lot of data), we can assume it is well formed:

<TableName FirstField="Foo" SecondField="Bar" /> 
<TableName FirstField="Foo" SecondField="Bar" /> 
...

Currently we are only using attribute based values, but we may use elements in the future if the data becomes more complex.

SOLUTION

Based on multiple comments (thanks guys!)

string redact(string eventDetail)
{
    if (string.IsNullOrEmpty(eventDetail)) return eventDetail; //+1 for unit tests :)
    string detail = eventDetail.Trim();
    if (!detail.StartsWith("<") && !detail.EndsWith(">")) return eventDetail;
    XmlDocument xml = new XmlDocument();
    try
    {
        xml.LoadXml(string.Format("<Root>{0}</Root>", detail));
    }
    catch (XmlException e)
    {
        log.WarnFormat("Data NOT redacted. Caught {0} loading eventDetail {1}", e.Message, eventDetail);
        return eventDetail;
    }
    ... // redact
si618
  • 16,580
  • 12
  • 67
  • 84
  • I would go with the LoadXml, that way you know the "XML" data that was inputted is valid. If you use your method (the code), you can have malformed XML that will pass the test. – Martin Sep 29 '09 at 00:54
  • See http://stackoverflow.com/questions/1072158/validate-xml-syntax-only-in-c – Graviton Sep 29 '09 at 00:59
  • Are you writing this yourself? I don't quite understand why you're writing it in a way you can't interpret the areas correctly, then ... ? – Noon Silk Sep 29 '09 at 01:08
  • Thanks everyone, Ngu and RRUZ found very similar questions so I'll delete this one. – si618 Sep 29 '09 at 01:09
  • @Silky, this data is for our audit log, a lot of the event detail is for things like "Incorrect password", "Incorrect username", "Session timeout" etc. Some of it (the XML) is for lookup data - what did a user search for and what was returned. You're correct in saying that I could use another approach to determine whether to expect XML or not, but it would have to rely on a code, and we keep adding more of these all the time, and it wouldn't be 100% reliable because I can't control what goes into the log, as other programmers also write to it. – si618 Sep 29 '09 at 01:15
  • Still, same rules apply; with each log item I'd write something like `'`. It just seems that if you are writing it, you can put tokens around it in some fashion to make life a lot easier in the processing (and a lot more 'secure', because you have more direct control). – Noon Silk Sep 29 '09 at 01:18
  • Except that the audit log is huge, we're excepting millions of rows, so we want to keep it as compact as possible, and only resorting to XML when absolutely necessary. – si618 Sep 29 '09 at 01:24
  • So you just cycle the logs? Or back them up? I wasn't suggesting putting everything in XML, but even if you did, it wouldn't add that much (``) and have 'dns='1'. Even a simple tabbed format per line would be enough; but anyway, it's your application, and it's lunchtime for me, so I'll leave you to it :) – Noon Silk Sep 29 '09 at 01:28

7 Answers7

8

If you're going to accept not well formed XML in the first place, I think catching the exception is the best way to handle it.

lod3n
  • 2,893
  • 15
  • 16
  • I've updated the question with example data. We can assume it is well formed except for missing root element. – si618 Sep 29 '09 at 01:03
4

One possibility is to mix both solutions. You can use your redact method and try to load it (inside the if). This way, you'll only try to load what is likely to be a well-formed xml, and discard most of the non-xml entries.

Samuel Carrijo
  • 17,449
  • 12
  • 49
  • 59
  • I've marked this as the most appropriate answer, because I think it solves my problem in the most efficient way. For most cases, StartsWith < and EndsWith > will filter out non-xml data, and for rare situations like Ira Baxter describes, catching the XmlException will solve those. – si618 Sep 29 '09 at 04:49
2

If your goal is reliability then the best option is to use XmlDocument.LoadXml to determine if it's valid XML or not. A full parse of the data may be expensive but it's the only way to reliably tell if it's valid XML or not. Otherwise any character you don't examine in the buffer could cause the data to be illegal XML.

JaredPar
  • 733,204
  • 149
  • 1,241
  • 1,454
  • I don't think `XmlDocument` is a good choice here - he doesn't need DOM, merely to validate. Looks like `XmlReader` and `try { while (reader.Read(); } catch(XmlException ex) { ... }` would be a more lightweight approach. – Pavel Minaev Sep 29 '09 at 00:57
  • @Pavel, but I also have to modify the Xml to redact the data, hence the need for XmlDocument. – si618 Sep 29 '09 at 01:08
  • Agreed, but if I combine approaches (as per Samuel's idea), then I should catch 99% of the plain text with the StartsWith and EndsWith code, and leave the other 1% to be caught if LoadXml throws XmlException. – si618 Sep 29 '09 at 01:34
1

Depends on how accurate a test you want. Considering that you already don't have the official <xml, you're already trying to detect something that isn't XML. Ideally you'd parse the text by a full XML parser (as you suggest LoadXML); anything it rejects isn't XML. The question is, do you care if you accept a non-XML string? For instance, are you OK with accepting

  <the quick brown fox jumped over the lazy dog's back>

as XML and stripping it? If so, your technique is fine. If not, you have to decide how tight a test you want and code a recognizer with that degree of tightness.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
1

How is the data coming to you? What is the other type of data surrounding it? Perhaps there is a better way; perhaps you can tokenise the data you control, and then infer that anything that is not within those tokens is XML, but we'd need to know more.

Failing a cute solution like that, I think what you have is fine (for validating that it starts and ends with those characters).

We need to know more about the data format really.

Noon Silk
  • 54,084
  • 6
  • 88
  • 105
0

If the XML contains no root element (i.e. it's an XML fragment, not a full document), then the following would be perfectly valid sample, as well - but wouldn't match your detector:

foo<bar/>baz

In fact, any text string would be valid XML fragment (consider if the original XML document was just the root element wrapping some text, and you take the root element tags away)!

Pavel Minaev
  • 99,783
  • 25
  • 219
  • 289
0
try
{
    XmlDocument myDoc = new XmlDocument();
    myDoc.LoadXml(myString);
}
catch(XmlException ex)
{
    //take care of the exception
}
Evgeny
  • 3,320
  • 7
  • 39
  • 50
  • 1
    Of course, and this is stated in the question. But catching exceptions is expensive when I know a lot of the data is not xml. – si618 Sep 29 '09 at 04:50