1

I would like to match the numbers 123456789 and 012 using only one regex in the following strings. I am not sure how to handle all the following scenarios with a single regex:

<one><num>123456789</num><code>012</code></one>

<two><code>012</code><num>123456789</num></two>

<three num="123456789" code="012" />

<four code="012" num="123456789" />

<five code="012"><num>123456789</num></five>

<six num="123456789"><code>012</code></six>

They also don't have to be on the same line like above, for example:

<seven>
<num>123456789</num>
<code>012</code>
</seven>
Xaisoft
  • 45,655
  • 87
  • 279
  • 432
  • 4
    *Why* in the heavens are you needing to use a Regex for this, when Linq to XML would be so much easier? – Andrew Barber Jan 27 '12 at 07:06
  • @AndrewBarber - This simple answer is because I want to use Regex, but if you can also show me in Linq to XML, I would love to see it. – Xaisoft Jan 27 '12 at 07:14
  • As I understand you want to find all numbers from num and code tags and as attributes with the same name? – FLCL Jan 27 '12 at 07:18
  • @Aleksey - I want to just match 12345679 and 012, nothing after and nothing before and nothing in between – Xaisoft Jan 27 '12 at 07:20
  • @Aleksey - I used those numbers as examples, but they could be different and there might be other parts of the xml that contain 3 digit characters that I do not want to match. – Xaisoft Jan 27 '12 at 07:29
  • @Xaisoft - that's a new requirement (that there are other 3 digit numbers you don't want to match). Can you take a few minutes, create some more complete examples, and try to clearly specify your criteria? – Damien_The_Unbeliever Jan 27 '12 at 07:47
  • 1
    Apparently @Xaisoft is after the value of a "num" attribute or node, and the value of a "code" attribute or node. Sounds like an xml-aware approach is way better that a regex approach. – Hans Kesting Jan 27 '12 at 07:56
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Dominik Jan 27 '12 at 08:40

3 Answers3

1

In a more abstract level, the problem is to parse either an attribute or a node named num or code. Considering C# already has libraries to parse XML documents (and such solutions are also acceptable according to your comments), it's more natural to take advantage of these libraries. The following function will return the specified attribute/node.

    static string ParseNode(XmlElement e, string AttributeOrNodeName)
    {
        if (e.HasAttribute(AttributeOrNodeName))
        {
            return e.GetAttribute(AttributeOrNodeName);
        }
        var node = e[AttributeOrNodeName];
        if (node != null)
        {
            return node.InnerText;
        }
        throw new Exception("The input element doesn't have specified attribute or node.");
    }

A test code is like

 var doc = new XmlDocument();
 var xmlString = "<test><node><num>123456789</num><code>012</code></node>\r\n"
     + "<node><code>012</code><num>123456789</num></node>\r\n"
     + "<node num=\"123456789\" code=\"012\" />\r\n"
     + "<node code=\"012\" num=\"123456789\" />\r\n"
     + "<node code=\"012\"><num>123456789</num></node>\r\n"
     + "<node num=\"123456789\"><code>012</code></node>\r\n"
     + @"<node>
         <num>123456789</num>
         <code>012</code>
         </node>
         </test>";
 doc.LoadXml(xmlString);
 foreach (var num in doc.DocumentElement.ChildNodes.Cast<XmlElement>().Select(x => ParseNode(x, "num")))
 {
     Console.WriteLine(num);
 }
 Console.WriteLine();
 foreach (var code in doc.DocumentElement.ChildNodes.Cast<XmlElement>().Select(x => ParseNode(x, "code")))
 {
     Console.WriteLine(code);
 }

In my environment (.NET 4), the code captures all the num and code values.

grapeot
  • 1,594
  • 10
  • 21
1

Parsing XML with regex is not a good idea. You can use XPath or xlinq. xlinq is easier. You must reference System.Xml.Linq and System.Xml and add using declerations. I wrote the code on here, not in visual studio, so there may be minor bugs...

// var xml = ** load xml string
var document = XDocument.Parse(xml);

foreach(var i in document.Root.Elements())
{
    var num = ""; 
    var code = "";

    if(i.Attributes("num").Length > 0)
    {
        Console.WriteLine("Num: {0}", i.Attributes("num")[0].Value);
        Console.WriteLine("Code: {0}", i.Attributes("code")[0].Value);
    }
    else
    {
        Console.WriteLine("Num: {0}", i.Element("num").Value);
        Console.WriteLine("Code: {0}", i.Element("code").Value);
    }
}
Medeni Baykal
  • 4,223
  • 1
  • 28
  • 36
0

This seems to be doing the trick:

new Regex(@"(?s)<(\w+)(?=.{0,30}(<num>\s*|num="")(\d+))(?=.{0,30}(<code>\s*|code="")(\d+)).*?(/>|</\1>)")

Groups 3 and 5 have "num" and "code" values respectively. It is also reasonably strict, as one of the main concerns when writing regex is to not capture something you don't want (capturing what you want is easy).

user1096188
  • 1,809
  • 12
  • 11