-1

I am trying to match elements that have no other children elements, but also have content. No content also includes whitespace and   characters. I need to do this in C#.

Take this XML for instance:

<1>
    <2><3 /></2>
    <4>
        <5>This is match 1</5>
    </4>
    <6>     
         </6>
    <7>    &nbsp;&nbsp;&nbsp;&nbsp;    &nbsp;&nbsp;&nbsp;</7>
    <8>This is match 2</8>
</1>

So only elements 5 and 8 match. The rest of the elements have child elements or "whitespace" (spaces, tabs, carriage returns, new lines, &ampnbsp;)

Note

SLaks posted:

"In general, you must not parse XML using regular expressions. Instead, use the System.Xml namespace."

This unfortunately is not viable in this situation. This is an application that was not made by my team and we need to optimize it without rewriting anything (not my decision). It is invalid XML and so I need to do this in order to make it valid. Then I can treat it as xml :)

So in other words, it is a string that closely resembles XML.

This is what I have come up with so far, it accounts for everything but the "whitespace" exclusion:

  Regex ElementExpression = new Regex(
      @"<(?'tag'\w+?).*>" + // match first tag, and name it 'tag'
      @"(?'text'[^<>]*[\\S]+?)" + // match text content, name it 'text'
      @"</\k'tag'>" // match last tag, denoted by 'tag'
      , RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);
Phobis
  • 7,524
  • 10
  • 47
  • 76
  • yeah, eg: `      ‍ ‌`... – nickf Dec 11 '09 at 02:52
  • What makes the format invalid? What's the specification for the format in use? If it's not XML, and we don't know the specification used, it's extremely difficult to offer a useful solution... Also, why use a regex, rather than writing a simple parser? This seems much more like a parser-type problem than a regex-type problem to me... – atk Dec 11 '09 at 04:31
  • lol, there is no format. It is some junk that a software-programmer-wanna-be made. The entire software product is junk. It is unfix-able and needs to be scrapped. Unfortunately, the company I am fixing this for is in a tight spot and needs this to be patched in one day, before they work on a long term solution to do a rewrite. (it was a project done by a consulting company that doesn't even offer software-consulting, but integration consulting. Poor management decision. Then if that wasn't bad enough the project was built for 2 years with no oversight!) Anyway, I digress. – Phobis Dec 11 '09 at 05:04

6 Answers6

2

In general, you must not parse XML using regular expressions.

Instead, use the System.Xml namespace.

Community
  • 1
  • 1
SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • This unfortunately is not viable in this situation. This is an application that was not mad by my team and we need to optimize it without rewriting anything (not my decision). It is invalid XML and so I need to do this in order to make it valid. Then I can treat it as xml :) – Phobis Dec 11 '09 at 02:09
  • So I am parsing a string that very closely resembles XML. – Phobis Dec 11 '09 at 02:10
1

The regex for this will be quite cumbersome. Basically you need a regex that looks for balanced pairs LinK and within the balanced pair you want anything that is valid for your scenario. The "valid for your scenario is the crappy part. Given the snippet you showed you want a regex similar to:

<(?<tag>\w*)>(?<text>.*)</\k<tag>> 

(Courtesy of Expresso)

(?<text>.*) <- is what you will have to construct by hand to match your elim criteria
GrayWizardx
  • 19,561
  • 2
  • 30
  • 43
  • Yes... that is what I have so far... Thank you though... At least you are trying to help me solve the solution as I asked for. Instead of telling me another way to do it :) I just need to get the regex to exclude   and any whitespace combined. – Phobis Dec 11 '09 at 03:20
  • You might not do that in the regex itself. If you capture each candidate, and then verify it after the regex, it might be easier. otherwise your elim pattern is going to be very complex. I would just get all the matches either way (including invalid ones) and then iterate over them and throw out the ones you dont want – GrayWizardx Dec 11 '09 at 03:25
  • Oops saw someone posted the same solution with a down vote. Sorry. Not sure of the exact syntax for the elim pattern sorry. – GrayWizardx Dec 11 '09 at 03:26
1

I would not use regular expressions to do this! I would run it through a Tidy utility and then use XSLT and XPath.

Josh Stodola
  • 81,538
  • 47
  • 180
  • 227
  • 1
    That's why you use a tidying utility. The challenge is finding one that works on your particular brand of poorly-formed XML. – Robert Rossney Dec 11 '09 at 21:54
  • Or roll your own tidy utility that takes care of the specific problems you have (given that your XML isn't ridiculously malformed). It can't be that difficult... – Josh Stodola Dec 14 '09 at 01:24
0

I was able to get what I wanted by using one regex to get the elements and a second regex to remove the ones with the whitespace I defined.

With about 30MB of data it takes 3 seconds.

  Regex ElementExpression = new Regex(
            @"<(?'tag'\w+?)(?'attributes'.*?)>" + // match first tag, and name it 'tag'
            @"(?'text'[^<>]*?)" + // match text content, name it 'text'
            @"</\k'tag'>" // match last tag, denoted by 'tag'
            , RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);


  Regex WhiteSpaceExpression = new Regex(@"\A((&nbsp;)|(\s)|(\r))*\Z", RegexOptions.Multiline | RegexOptions.Compiled | RegexOptions.IgnoreCase);

  text = ElementExpression.Replace(text, delegate(Match match){
        if (match.Groups.Count > 0){
           Group textGroup = match.Groups["text"];
           if (!WhiteSpaceExpression.IsMatch(textGroup.Value)){
              return String.Format("<{0}{1}>{2}</{0}>", match.Groups["tag"].Value, match.Groups["attributes"].Value, HttpUtility.HtmlEncode(textGroup.Value));
           }
           else{
              return String.Format("<{0}{1} />", match.Groups["tag"].Value, match.Groups["attributes"].Value);
           }
        }
        return match.Value;
  });
Phobis
  • 7,524
  • 10
  • 47
  • 76
  • Again, I want to make clear that this is horrible thing to do and I know that, but it is the scope of the task I have at hand. This is outside code that needed to be optimized, while working the same way. (the previous code was done client side in javascript and would bomb after about an hour!) ...You get what you pay for, and the company that paid for this paid a non-software consulting company to build this junk. – Phobis Dec 11 '09 at 04:58
0

If it's not XML that's bad. Saying that it's a "string that closely represents XML" is not really an adequate definition of the problem. There are an infinity of ways for a string to closely resemble XML, and a parsing solution devised for one won't work with another.

If you can be specific about the ways in which the string will deviate from XML - i.e., if you can identify the specific mistakes that the original developer was making in attempting to write XML - it should be possible to undo the damage, turn the string into well-formed XML, and then use a DOM approach to find the data that you're looking for.

If you can't be specific about the ways in which the string deviates from XML, then you have a much bigger problem than writing a regular expression.

Robert Rossney
  • 94,622
  • 24
  • 146
  • 218
-1

I would approach it in two passes. (in perl, but regexes should translate. )

First pass. Extract all strings.

my @strings = $s =~ /<[^>]+>([^<>]+)<[^/>]*/[^/>]*>/g;

Second pass. Filter out unwanted

@strings = grep {!/&nbsp;|^\s+$/} @strings;
zen
  • 12,613
  • 4
  • 24
  • 16