4

I'm writing some code that handles logging xml data and I would like to be able to replace the content of certain elements (eg passwords) in the document. I'd rather not serialize and parse the document as my code will be handling a variety of schemas.

Sample input documents:

doc #1:

   <user>
       <userid>jsmith</userid>
       <password>myPword</password>
    </user>

doc #2:

<secinfo>
       <ns:username>jsmith</ns:username>
       <ns:password>myPword</ns:password>
 </secinfo>

What I'd like my output to be:

output doc #1:

<user>
       <userid>jsmith</userid>
       <password>XXXXX</password>
 </user>

output doc #2:

<secinfo>
       <ns:username>jsmith</ns:username>
       <ns:password>XXXXX</ns:password>
 </secinfo>

Since the documents I'll be processing could have a variety of schemas, I was hoping to come up with a nice generic regular expression solution that could find elements with password in them and mask the content accordingly.

Can I solve this using regular expressions and C# or is there a more efficient way?

Millhouse
  • 732
  • 1
  • 8
  • 17
  • 1
    I would certainly avoid using a regex when there's many other fine tools to do what you want to do. – Robert P Jan 15 '09 at 20:47
  • Even if regexs were capable of this (they aren't) havng a variety of schema makes it *more necessary* to use a parser of some form or another not less. – annakata Jan 15 '09 at 20:51

7 Answers7

21

This problem is best solved with XSLT:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="//password">
        <xsl:copy>
            <xsl:text>XXXXX</xsl:text>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

This will work for both inputs as long as you handle the namespaces properly.

Edit : Clarification of what I mean by "handle namespaces properly"

Make sure your source document that has the ns name prefix has as namespace defined for the document like so:

<?xml version="1.0" encoding="utf-8"?>
<secinfo xmlns:ns="urn:foo">
    <ns:username>jsmith</ns:username>
    <ns:password>XXXXX</ns:password>
</secinfo>
Community
  • 1
  • 1
Andrew Hare
  • 344,730
  • 71
  • 640
  • 635
9

I'd say you're better off parsing the content with a .NET XmlDocument object and finding password elements using XPath, then changing their innerXML properties. It has the advantage of being more correct (since XML isn't regular in the first place), and it's conceptually easy to understand.

Welbog
  • 59,154
  • 9
  • 110
  • 123
8

From experience with systems that try to parse and/or modify XML without proper parsers, let me say: DON'T DO IT. Use an XML parser (There are other answers here that have ways to do that quickly and easily).

Using non-xml methods to parse and/or modify an XML stream will ALWAYS lead you to pain at some point in the future. I know, because I have felt that pain.

I know that it seems like it would be quicker-at-runtime/simpler-to-code/easier-to-understand/whatever if you use the regex solution. But you're just going to make someone's life miserable later.

Michael Kohne
  • 11,888
  • 3
  • 47
  • 79
  • You make a good point, I think some of the other proposed solutions here (XSLT, XPATH or XDocument) will save me from some pain in the future. – Millhouse Jan 15 '09 at 21:40
  • 1
    There are very few absolute rules that aren't riddled with exceptions. "Don't ever use string manipulation tools to parse or modify XML" is one of them. – Robert Rossney Jan 17 '09 at 02:12
4

You can use regular expressions if you know enough about what you are trying to match. For example if you are looking for any tag that has the word "password" in it with no inner tags this regex expression would work:

(<([^>]*?password[^>]*?)>)([^<]*?)(<\/\2>)

You could use the same C# replace statement in zowat's answer as well but for the replace string you would want to use "$1XXXXX$4" instead.

John Conrad
  • 305
  • 1
  • 2
  • 7
  • No, you cannot (or should not) because regexes don't know hierarchy, cannot load and parse named or numeric entities, let alone external doctypes, have trouble with CDATA and cannot deal with namespaces. Use XSLT instead. You will just open a can of worms tyring it this way. – Abel Mar 06 '11 at 16:42
  • Question was explicitly mentioning regular expressions. Why everyone trying to offer infeasible solution instead? – SPDenver Jul 08 '11 at 22:20
1

The main reason that XSLT exist is to be able to transform XML-structures, this means that an XSLT is a type of stylesheet that can be used to alter the order of elements och change content of elements. Therefore this is a typical situation where it´s highly recommended to use XSLT instead of parsing as Andrew Hare said in a previous post.

Abel
  • 56,041
  • 24
  • 146
  • 247
pelle
  • 11
  • 1
1

Regex is the wrong approach for this, I've seen it go so badly wrong when you least expect it.

XDocument is way more fun anyway:

XDocument doc = XDocument.Parse(@"
            <user>
                <userid>jsmith</userid>
                <password>password</password>
            </user>");

doc.Element("user").Element("password").Value = "XXXX";

// Temp namespace just for the purposes of the example -
XDocument doc2 = XDocument.Parse(@"
            <secinfo xmlns:ns='http://tempuru.org/users'>
                <ns:userid>jsmith</ns:userid>
                <ns:password>password</ns:password>
            </secinfo>");

doc2.Element("secinfo").Element("{http://tempuru.org/users}password").Value = "XXXXX";
Kev
  • 118,037
  • 53
  • 300
  • 385
1

Here is what I came up with when I went with XMLDocument, it may not be as slick as XSLT, but should be generic enough to handle a variety of documents:

            //input is a String with some valid XML
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(input);
            XmlNodeList nodeList = doc.SelectNodes("//*");

            foreach (XmlNode node in nodeList)
            {
                if (node.Name.ToUpper().Contains("PASSWORD"))
                {
                    node.InnerText = "XXXX";
                }
                else if (node.Attributes.Count > 0)
                {
                    foreach (XmlAttribute a in node.Attributes)
                    {
                        if (a.LocalName.ToUpper().Contains("PASSWORD"))
                        {
                            a.InnerText = "XXXXX";
                        }
                    }
                }    
            }
Millhouse
  • 732
  • 1
  • 8
  • 17
  • I think you want to use LocalName for both elements and attributes. Also, if you make this a recursive function that walks the XML tree, you don't have to start out by building a list of all the elements in the document. – Robert Rossney Jan 17 '09 at 02:41