0

I have a string like:

[a b="c" d="e"]Some multi line text[/a]

Now the part d="e" is optional. I want to convert such type of string into:

<a b="c" d="e">Some multi line text</a>

The values of a b and d are constant, so I don't need to catch them. I just need the values of c, e and the text between the tags and create an equivalent xml based expression. So how to do that, because there is some optional part also.

Brad Mace
  • 27,194
  • 17
  • 102
  • 148
Priyank Bolia
  • 14,077
  • 14
  • 61
  • 82
  • I assume there might be cases like: `[a b="" d="e"]Some here too[/a]` ? – o.k.w Nov 28 '09 at 08:25
  • Anything can be in between the [a..]..[/a] – Priyank Bolia Nov 28 '09 at 08:35
  • If "anything can be in between [a..]..[/a]" then you will generate an infinitely large regex which will be infinitely broken – peter.murray.rust Nov 28 '09 at 08:37
  • I don't think, as there won't be any [/a] in the text, even if its present I need to match up to the first ending [/a] – Priyank Bolia Nov 28 '09 at 08:40
  • You have said "the text can be anything" and "and there can't be HTML markup or scripts per se". So you have an undefined/contradictory problem and it's impossible to write a regex. Only you know what the grammar - if there is one - is for your multiline context. Only you can write a parser. And regexes will almost certainly lead to problems – peter.murray.rust Nov 28 '09 at 08:51
  • Do you notice that your problem is horribly underspecified? Even with the amendments in the comments, there are a lot of dangerous assumptions and undefined special cases hanging around, ready to cause trouble. And while I *think* (after the current problem description) that this could be solved by a regular expression, it might *still* not be a good idea, what with changing requirements and updating the code to reflect them. – Konrad Rudolph Nov 28 '09 at 09:15
  • Assume your "anything" contains `bookseller="Barnes&Noble"` Then all the regexes so far proposed will generate broken XML. Because XML is more complex than you assume – peter.murray.rust Nov 28 '09 at 09:17

3 Answers3

0

If you are actually thinking of processing (pseudo)-HTML using regexes,

don't

SO is filled with posts where regexes are proposed for HTML/XML and answers pointing out why this is a bad idea.

Suppose your multiline text ("which can be anything") contains

[a b="foo" [a b="bar"]]

a regex cannot detect this.

See the classic answer in: RegEx match open tags except XHTML self-contained tags

which has:

I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex Officer. No matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must. It's only broken code, not life and death. – bobince

Seriously. Find an XML or HTML DOM and populate it with your data. Then serialize it. That will take care of all the problems you don't even know you have got.

Community
  • 1
  • 1
peter.murray.rust
  • 37,407
  • 44
  • 153
  • 217
0

Would some multiline text include [ and ]? If not, you can just replace [ with < and ] with > using string.replace - no need of regex.

Update: If it can be anything but [/a], you can replace

^\[a([^\]]+)](.*?)\[/a]$

with

<a$1>$2</a>

I haven't escaped ] and / in the regex - escape them if necessary to get

^\[a([^\]]+)\](.*?)\[\/a\]$
Amarghosh
  • 58,710
  • 11
  • 92
  • 121
0

For HTML tags, please use HTML parser.

For [a][/a], you can do like following

Match m=Regex.Match(@"[a b=""c"" d=""e""]Some multi line text[/a]", 
                    @"\[a b=""([^""]+)"" d=""([^""]+)""\](.*?)\[/a\]",
                    RegexOptions.Multiline);

m.Groups[1].Value
"c"
m.Groups[2].Value
"e"
m.Groups[3].Value
"Some multi line text"

Here is Regex.Replace (I am not that prefer though)

string inputStr = @"[a b=""[[[[c]]]]"" d=""e[]""]Some multi line text[/a]";
string resultStr=Regex.Replace(inputStr,
                            @"\[a( b=""[^""]+"")( d=""[^""]+"")?\](.*?)\[/a\]",
                            @"<a$1$2>$3</a>", 
                            RegexOptions.Multiline);
YOU
  • 120,166
  • 34
  • 186
  • 219
  • First of all I am not parsing HTML, its text with some tags that need to be converted to XML. Second, is there a direct way like using Regex.Replace function – Priyank Bolia Nov 28 '09 at 08:39
  • You missed the question: the part d="e" is optional. I guess your Regex.Replace won't work. – Priyank Bolia Nov 28 '09 at 08:51
  • If $2 doesn't contain anything, and then shouldn't be d="$2" in the output. – Priyank Bolia Nov 28 '09 at 08:58
  • Though the answer is not what I am looking for, as it matched everything instead of just the attribute values. I figured it out, the best is to use the Match and a MatchEvaluator Delegate in the Regex.Replace method. Accepting your answer, as it was the most helpful. – Priyank Bolia Nov 28 '09 at 10:20