Capture two blocks in a string

Question

I have a string that's in this format:

Message: Something bad happened in This.Place < Description> Some sort of information here< /Description>< Error> Some other stuff< /Error>< Message> Some message here.

I can't seem to figure out how to match everything in the Description block and also everything in the Message block using regex.

My question is in two parts: 1.) Is regex the right choice for this? 2.) If so, how can I match those two blocks and exclude the rest?

I can match the first part with a simple < Description>.*< /Description>, but can't match < Message>. I've tried excluding everything inbetween by trying to use what's described here http://blog.codinghorror.com/excluding-matches-with-regular-expressions/

[You shouldn't be trying to parse XML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). You should be using an XML parser specifically designed for parsing XML. — Servy, Jun 10 '14 at 20:07
You could try Regex but if your string is composed of tags it might be a good idea to just parse everything using XmlDocument and then just traverse the nodes to obtain what you need; or just de-serialize the string to an object, so that you can acccess your data like `message.Description` — Lucian, Jun 10 '14 at 20:08
@Servy Not XML, just uses angle brackets to label blocks in an error log. — Michael Bowman, Jun 10 '14 at 20:11
@Servy markup language. What I'm receiving this data from is not an XML document. — Michael Bowman, Jun 10 '14 at 20:12
@MichaelBowman But it's XML data, regardless of where you got it. — Servy, Jun 10 '14 at 20:13
Don't use the markup, just use a pipe delimiter and use string.split. — Bill Sempf, Jun 10 '14 at 20:23
C# and .NET have facilities specifically designed to read XML. Regex, is the difficult, mind-bending, brittle, error-prone, reinventing-the-wheel way of doing it. — Robert Harvey, Jun 10 '14 at 20:24
I'm not sure that C# will spawn an XmlDocument object for something that has open test at the beginning. Perhaps trim off everything before the first open angle bracket, since you don't need that, then open an XmlDocument? — Bill Sempf, Jun 10 '14 at 20:35
... and of course you could always do it the old fashioned way using `string.IndexOf` and `string.Substring` to parse the string manually :) — Lucian, Jun 10 '14 at 20:41
No harm in that either. Regex is good for a LOT of things, but not parsing markup. That's just not what it was designed for. — Bill Sempf, Jun 10 '14 at 20:42
@Lucian in my case, it was easier to use IndexOf and Substring than to try to remove the non-XML information and try to parse it as XML. Thanks for the suggestion - it works quite well. — Michael Bowman, Jun 18 '14 at 21:04

score 0 · Answer 1 · answered Jun 10 '14 at 21:50

With all the disclaimers about parsing xml in regex, it's still good do know how to do this with regex.

For instance, if you had your back against the wall, this would works for the < Description> tag (adapt it for the other tag).

(?<=< Description>).*?(?=< /Description>)

Some things you need to know:

The (?<=< Description>) is a lookbehind that asserts that at that position in the string, what precedes is < Description>. So if you change the spaces in your tag, all bets are off. To handle potential typing errors (depending on the origin of your text), you can insert optional spaces: (?<=< *Description *>) where the * repeats the space character zero or more times. The lookbehind is only an assertion, it does not consume any characters.
The .*? lazily eats up all characters until it can find what follows...
Which is the (?=< /Description>) lookahead that asserts that at that position in the string, what follows is < /Description>

In code, this becomes something like:

description = Regex.Match(yourstring, "(?<=< *Description *>).*?(?=< */Description *>)").Value;

Sorry, I hadn't been on for a couple days since. I'd figured out a solution that utilized IndexOf and Substring. It seems to work fine, so I'm going with it for now. — Michael Bowman, Jun 18 '14 at 21:03

Darryl · Answer 2 · 2014-06-11T02:32:11.653

This is how I'd parse it. Caveat: I've written the regex assuming the format shown in the example you've provided is pretty rigid; if the data varies a little (say, there isn't always a space after the '<' characters), you'll need to tweak it a little. But this should get you going.

var text = "Message: Something bad happened in This.Place < Description> Some"+
           " sort of information here< /Description>< Error> Some other stuff"+
           "< /Error>< Message> Some message here.";

var regex = new Regex(
      "^.*?<\\sDescription\\>(?<description>.*?)<\\s/Description\\>"+
      ".*?<\\sMessage\\>(?<message>.*?)$",
      RegexOptions.IgnoreCase | RegexOptions.Singleline
    );

var matches = regex.Match(text);

if (matches.Success) {
    var desc = matches.Groups["description"].Value;
    // " Some sort of information here"

    var msg = matches.Groups["message"].Value;
    // " Some message here."
}

score 0 · Accepted Answer · answered Jun 18 '14 at 21:26

It was fairly difficult to try to remove the non-XML-formatted data from the text, so IndexOf and Substring ended up being what I used. IndexOf will find the index of a specified character or string, and Substring captures characters based on a starting point and a count of how many it should capture.

int descriptionBegin = 0;
int descriptionEnd = 0;
int messageBegin = 0;
int messageEnd = 0;
foreach (string j in errorList)
{
    descriptionBegin = j.IndexOf("<Description>") + 13; // starts after the opening tag
    descriptionEnd = j.IndexOf("</Description>") - 13; // ends before the closing tag
    messageBegin = j.IndexOf("<Message>") + 9; // starts after the opening tag
    messageEnd = j.IndexOf("</Message>") - 9; // ends before the closing tag
    descriptionDiff = descriptionEnd - descriptionBegin; // amount of chars between tags
    messageDiff = messageEnd - messageBegin; // amount of chars between tags
    string description = j.Substring(descriptionBegin, descriptionDiff); // grabs only specified amt of chars
    string message = j.Substring(messageBegin, messageDiff); // grabs only specified amt of chars
}

Thanks @Lucius for the suggestion. @Darryl that actually looks like it might work. Thanks for the thorough answer...I might try that out for other stuff in the future (non-XML of course :))

Capture two blocks in a string

3 Answers3