7

I have a string with xml data that I pulled from a web service. The data is ugly and has some invalid chars in the Name tags of the xml. For example, I may see something like:

<Author>Scott the Coder</Author><Address#>My address</Address#>

The # in the Address name field is invalid. I am looking for a regular expression that will remove all the invalid chars from the name tags BUT leave all the chars in the Value section of the xml. In other words, I want to use RegEx to remvove chars only from the opening name tags and closing name tags. Everything else should remaing the same.

I don't have all the invalid chars yet, but this will get me started: #{}&()

Is it possible to do what I am trying to do?

Scott
  • 874
  • 3
  • 12
  • 36
  • 2
    It's a good idea to avoid referring to such things as "XML data". It's not XML. That's why you're having trouble with it. You need to make the supplier of the data aware that their output is junk. – Michael Kay Jan 24 '11 at 09:49
  • 1
    Ya, that's what I need to do. No reason to try and simplify things on this message board while working out an issue. I should just hunt down the guy that did it and tell him he's a bad boy. That will solve my problem.... er, wait, no.. I still have the same problem... Next! – Scott Jan 25 '11 at 04:52
  • You might want to add `$` to disallowed characters. – TinyTimZamboni Aug 11 '14 at 21:44

5 Answers5

5

If your intention is to only check validity of a name for a Xml node, I suggest you to take a look at the XmlConvert class; especially the VerifyName and VerifyNCName methods.

Also note that with that class, you could accept any text as node name using the EncodeName and EncodeLocalName methods.

Using those methods will be far easier, safe and faster than performing a Regular Expression.

Sam B
  • 2,441
  • 2
  • 18
  • 20
2

you can use string replace to replace all invalid chracters. Usually the ascii control characters will create problem in XML reading.

to avoid use this function

     public static string CleanInvalidXmlChars( this string text)
    {
        // From xml spec valid chars:
        // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]    
        // any Unicode character, excluding the surrogate blocks, FFFE, and FFFF.
        string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
        return Regex.Replace(text, re, "");
    }


     xmlcontent = xmlcontent.CleanInvalidXmlChars();

this will clean chracters specified in regular expression. i get this from this site

sudhansu63
  • 6,025
  • 4
  • 39
  • 52
  • 1
    I think this regex is missing "\" before "x10FFFF". It will not strip out \x10 for example – d_z Aug 19 '16 at 01:50
1

I had a simple form with two text areas and one button. This seems to do the trick.

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.Text.RegularExpressions;

namespace WindowsFormsApplication3
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Regex r = new Regex(@"(?<=\<\w+)[#\{\}\(\)\&](?=\>)|(?<=\</\w+)[#\{\}\(\)\&](?=\>)");
            textBox2.Text = r.Replace(textBox1.Text, new MatchEvaluator(deleteMatch));
        }

        string deleteMatch(Match m) { return ""; }
    }
}
Marco
  • 1,346
  • 8
  • 9
  • I am trying to avoid searching the string more than once since the string could be huge. However, if I can't find a clean RegEx way to do it, I'll have to spend some time writing a parser that does just that. – Scott Jan 24 '11 at 05:22
  • I understand better now. This seems like something that would help: http://www.perlmonks.org/?node_id=518444 (I mean look ahead and look behind, not the perl part). Ok found them for c# regexp: (?=...) A positive lookahead (?!...) A negative lookahead (?<=...) A positive lookbehind . (?<!...) A negative lookbehind . – Marco Jan 24 '11 at 06:05
1

RegEx is a problematic way to go unless you really only have one file to process. Pain, frustration, bugs is your future there...

I you really want to use a RegEx, there are useful ones HERE that I have used in Perl.

Have you considered using a parser instead?

Two to consider:

LINQ for XML

XmlDocument

Once parsed, you can re-save the troublesome sections or just go on your programatic way.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • I'm not sure whether these characters are valid for tag names or not, but if they are not you may not be able to parse the xml (in fact, that may be what led to this question). If you can parse it, you don't really have to fix it. It is worth to try with different parsers thought. – Kobi Jan 24 '11 at 05:31
  • Actually, XMLDocument is where my issue is. XMLDocument throws when xmlDoc.LoadXml(xmlString). I need to fix it before running it through the parser. Unless there is something about XMLDocument that I don't konw, I cna't use it this way?? – Scott Jan 24 '11 at 05:32
  • @Kobi All these characters are invalid in element names. No conforming XML parser will accept this input. – Michael Kay Jan 24 '11 at 09:47
1

Try this:

s = Regex.Replace(s, @"[#{}&()]+(?=[^<>]*>)", "");

If the lookahead succeeds, the next angle bracket after the match is a right-pointing one (>), which indicates that the match occurred inside a tag.

Of course, this assumes the text is reasonably well-formed and that it contains no angle brackets aside from the ones in the tags.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156